Abstract:In recent years, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model, RoboMM, along with the comprehensive dataset, RoboData. RoboMM enhances 3D perception through camera parameters and occupancy supervision. Building on OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder blocks, improving modality fusion and fine-grained perception. RoboData offers the complete evaluation system by integrating several well-known datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, and actions, and the space alignment facilitates comprehensive learning from diverse robotic datasets. Equipped with RoboData and the unified physical space, RoboMM is the generalist policy that enables simultaneous evaluation across all tasks within multiple datasets, rather than focusing on limited selection of data or tasks. Its design significantly enhances robotic manipulation performance, increasing the average sequence length on the CALVIN from 1.7 to 3.3 and ensuring cross-embodiment capabilities, achieving state-of-the-art results across multiple datasets.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve include: 1. **Challenges in the application of multimodal models in robot manipulation**: Current multimodal models mainly focus on the understanding and generation of 2D images, which limits their practical applications in 3D physical space. Robots need to interact with 3D environments, while existing multimodal models have limited capabilities in this regard. 2. **Cost and efficiency issues in dataset construction**: Collecting large - scale robot datasets is costly and time - consuming. For example, it took 17 months to collect approximately 130,000 segments from the RT - 1 dataset. Therefore, integrating existing datasets from multiple platforms and various robots becomes particularly important to reduce the cost and time of data collection. To address these challenges, the paper makes two main contributions: 1. **RoboMM**: This is a large multimodal model specifically designed for robot manipulation. RoboMM enhances 3D environmental perception capabilities by combining camera parameters and occupancy supervision, and introduces the Modality - Isolation - Mask (MIM) mechanism, which improves the flexibility of modality fusion and fine - grained perception capabilities. 2. **RoboData**: This is a comprehensive dataset that integrates datasets from multiple platforms and various robots, including CALVIN, Meta - World, LIBERO, Robomimic, RoboCAS, ManiSkill2, RoboCasa, RLBench, and Colosseum. RoboData solves the problems of data heterogeneity and inconsistency by unifying the input and output spaces, enabling the model to effectively learn from diverse robot datasets. Through these innovations, RoboMM and RoboData significantly improve the performance of robot manipulation tasks, especially the generalization ability on multiple datasets and cross - platform evaluation ability. Specifically, the average sequence length of RoboMM on the CALVIN dataset has increased from 1.7 to 3.3, and it has achieved state - of - the - art results on multiple datasets.

RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation

MoreFusion: Multi-object Reasoning for 6D Pose Estimation from Volumetric Fusion

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images for Articulated Objects

Decision-Making in Robotic Grasping with Large Language Models.

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?

GAMMA: Generalizable Articulation Modeling and Manipulation for Articulated Objects

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Multiple Interactions Made Easy (MIME): Large Scale Demonstrations Data for Imitation

A Comprehensive Study of 3-D Vision-Based Robot Manipulation

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open-World Object Manipulation using Pre-trained Vision-Language Models

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

A Multi-modal Framework for Robots to Learn Manipulation Tasks from Human Demonstrations