Abstract:Multimodal Large Language Models (MLLMs) have shown impressive reasoning abilities and general intelligence in various domains. It inspires researchers to train end-to-end MLLMs or utilize large models to generate policies with human-selected prompts for embodied agents. However, these methods exhibit limited generalization capabilities on unseen tasks or scenarios, and overlook the multimodal environment information which is critical for robots to make decisions. In this paper, we introduce a novel Robotic Multimodal Perception-Planning (RoboMP$^2$) framework for robotic manipulation which consists of a Goal-Conditioned Multimodal Preceptor (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). Specially, GCMP captures environment states by employing a tailored MLLMs for embodied agents with the abilities of semantic reasoning and localization. RAMP utilizes coarse-to-fine retrieval method to find the $k$ most-relevant policies as in-context demonstrations to enhance the planner. Extensive experiments demonstrate the superiority of RoboMP$^2$ on both VIMA benchmark and real-world tasks, with around 10% improvement over the baselines.

What problem does this paper attempt to address?

The paper attempts to address the issue that existing Multimodal Large Language Models (MLLMs) and perception-planning methods in robotic manipulation tasks exhibit limited generalization capabilities when dealing with unseen tasks or scenarios, and they overlook multimodal environmental information crucial for robotic decision-making. Specifically, the paper points out: 1. **Limitations of Existing Methods**: - **End-to-End Models**: These models typically require closed-loop data for training, but in the real world, closed-loop data is very limited, causing these models to overfit and perform poorly in unseen environments. - **Prompt-Based Methods**: These methods rely on manually designed and selected prompt templates to generate plans, but they lack generalization capabilities for different tasks, especially when there is a significant difference between the task and the examples in the prompt templates. 2. **Environmental Perception and Task Planning**: - **Environmental Perception**: Existing perception models (such as YOLOv5 and CLIP) perform well in simple scenarios but struggle to recognize and locate objects with complex spatial relationships in more complicated scenes. - **Task Planning**: Existing strategies are mainly divided into end-to-end models and prompt-based methods, but these methods perform poorly when handling unseen tasks. To address these issues, the paper proposes a new robotic multimodal perception-planning framework (RoboMP2), which consists of a Goal-Conditioned Multimodal Perceiver (GCMP) and a Retrieval-Augmented Multimodal Planner (RAMP). The specific objectives include: - **Improving Perception Capabilities**: By using customized Multimodal Large Language Models (MLLMs), GCMP can understand and locate objects with complex referential expressions. - **Enhancing Planning Capabilities**: Through a retrieval-augmented approach, RAMP can adaptively select the most relevant strategies as examples, thereby improving the generalization capability of planning. In summary, the paper aims to enhance the perception and reasoning abilities of robots in unseen tasks and scenarios by fully leveraging multimodal information in the environment and the general intelligence of large models.

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

Decision-Making in Robotic Grasping with Large Language Models.

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Towards Robust Multi-Modal Reasoning via Model Selection

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models

Improving Planning with Large Language Models: A Modular Agentic Architecture

LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner

Task and Motion Planning with Large Language Models for Object Rearrangement

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization