RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

Qi Lv,Hao Li,Xiang Deng,Rui Shao,Michael Yu Wang,Liqiang Nie
2024-06-08
Abstract:Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts for embodied agents. However, these methods exhibit limited generalization capabilities on unseen tasks or scenarios, and overlook the multimodal environment information which is critical for robots to make decisions. In this paper, we introduce a novel Robotic Multimodal Perception-Planning (RoboMP$^2$) framework for robotic manipulation which consists of a Goal-Conditioned Multimodal Preceptor (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). Specially, GCMP captures environment states by employing a tailored MLLMs for embodied agents with the abilities of semantic reasoning and localization. RAMP utilizes coarse-to-fine retrieval method to find the $k$ most-relevant policies as in-context demonstrations to enhance the planner. Extensive experiments demonstrate the superiority of RoboMP$^2$ on both VIMA benchmark and real-world tasks, with around 10% improvement over the baselines.
Robotics
What problem does this paper attempt to address?
The paper attempts to address the issue that existing Multimodal Large Language Models (MLLMs) and perception-planning methods in robotic manipulation tasks exhibit limited generalization capabilities when dealing with unseen tasks or scenarios, and they overlook multimodal environmental information crucial for robotic decision-making. Specifically, the paper points out: 1. **Limitations of Existing Methods**: - **End-to-End Models**: These models typically require closed-loop data for training, but in the real world, closed-loop data is very limited, causing these models to overfit and perform poorly in unseen environments. - **Prompt-Based Methods**: These methods rely on manually designed and selected prompt templates to generate plans, but they lack generalization capabilities for different tasks, especially when there is a significant difference between the task and the examples in the prompt templates. 2. **Environmental Perception and Task Planning**: - **Environmental Perception**: Existing perception models (such as YOLOv5 and CLIP) perform well in simple scenarios but struggle to recognize and locate objects with complex spatial relationships in more complicated scenes. - **Task Planning**: Existing strategies are mainly divided into end-to-end models and prompt-based methods, but these methods perform poorly when handling unseen tasks. To address these issues, the paper proposes a new robotic multimodal perception-planning framework (RoboMP2), which consists of a Goal-Conditioned Multimodal Perceiver (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). The specific objectives include: - **Improving Perception Capabilities**: By using customized Multimodal Large Language Models (MLLMs), GCMP can understand and locate objects with complex referential expressions. - **Enhancing Planning Capabilities**: Through a retrieval-augmented approach, RAMP can adaptively select the most relevant strategies as examples, thereby improving the generalization capability of planning. In summary, the paper aims to enhance the perception and reasoning abilities of robots in unseen tasks and scenarios by fully leveraging multimodal information in the environment and the general intelligence of large models.