RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation

Feng Yan,Fanfan Liu,Liming Zheng,Yufeng Zhong,Yiyang Huang,Zechao Guan,Chengjian Feng,Lin Ma
2024-12-10
Abstract:In recent years, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model, RoboMM, along with the comprehensive dataset, RoboData. RoboMM enhances 3D perception through camera parameters and occupancy supervision. Building on OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder blocks, improving modality fusion and fine-grained perception. RoboData offers the complete evaluation system by integrating several well-known datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, and actions, and the space alignment facilitates comprehensive learning from diverse robotic datasets. Equipped with RoboData and the unified physical space, RoboMM is the generalist policy that enables simultaneous evaluation across all tasks within multiple datasets, rather than focusing on limited selection of data or tasks. Its design significantly enhances robotic manipulation performance, increasing the average sequence length on the CALVIN from 1.7 to 3.3 and ensuring cross-embodiment capabilities, achieving state-of-the-art results across multiple datasets.
Robotics,Multimedia
What problem does this paper attempt to address?
The main problems that this paper attempts to solve include: 1. **Challenges in the application of multimodal models in robot manipulation**: Current multimodal models mainly focus on the understanding and generation of 2D images, which limits their practical applications in 3D physical space. Robots need to interact with 3D environments, while existing multimodal models have limited capabilities in this regard. 2. **Cost and efficiency issues in dataset construction**: Collecting large - scale robot datasets is costly and time - consuming. For example, it took 17 months to collect approximately 130,000 segments from the RT - 1 dataset. Therefore, integrating existing datasets from multiple platforms and various robots becomes particularly important to reduce the cost and time of data collection. To address these challenges, the paper makes two main contributions: 1. **RoboMM**: This is a large multimodal model specifically designed for robot manipulation. RoboMM enhances 3D environmental perception capabilities by combining camera parameters and occupancy supervision, and introduces the Modality - Isolation - Mask (MIM) mechanism, which improves the flexibility of modality fusion and fine - grained perception capabilities. 2. **RoboData**: This is a comprehensive dataset that integrates datasets from multiple platforms and various robots, including CALVIN, Meta - World, LIBERO, Robomimic, RoboCAS, ManiSkill2, RoboCasa, RLBench, and Colosseum. RoboData solves the problems of data heterogeneity and inconsistency by unifying the input and output spaces, enabling the model to effectively learn from diverse robot datasets. Through these innovations, RoboMM and RoboData significantly improve the performance of robot manipulation tasks, especially the generalization ability on multiple datasets and cross - platform evaluation ability. Specifically, the average sequence length of RoboMM on the CALVIN dataset has increased from 1.7 to 3.3, and it has achieved state - of - the - art results on multiple datasets.