EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Yao Mu,Qinglong Zhang,Mengkang Hu,Wenhai Wang,Mingyu Ding,Jun Jin,Bin Wang,Jifeng Dai,Yu Qiao,Ping Luo

DOI: https://doi.org/10.48550/arXiv.2305.15021

IF: 3.7

2023-05-24

Robotics

Abstract:Embodied AI is a crucial frontier in robotics, capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments. In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities. To achieve this, we have made the following efforts: (i) We craft a large-scale embodied planning dataset, termed EgoCOT. The dataset consists of carefully selected videos from the Ego4D dataset, along with corresponding high-quality language instructions. Specifically, we generate a sequence of sub-goals with the "Chain of Thoughts" mode for effective embodied planning. (ii) We introduce an efficient training approach to EmbodiedGPT for high-quality plan generation, by adapting a 7B large language model (LLM) to the EgoCOT dataset via prefix tuning. (iii) We introduce a paradigm for extracting task-related features from LLM-generated planning queries to form a closed loop between high-level planning and low-level control. Extensive experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering. Notably, EmbodiedGPT significantly enhances the success rate of the embodied control task by extracting more effective features. It has achieved a remarkable 1.6 times increase in success rate on the Franka Kitchen benchmark and a 1.3 times increase on the Meta-World benchmark, compared to the BLIP-2 baseline fine-tuned with the Ego4D dataset.

What problem does this paper attempt to address?

The paper aims to address the issue of robots performing long-term tasks in physical environments. Specifically, it seeks to enable robots to perceive, reason, and act autonomously in real-world settings by endowing them with multimodal understanding and execution capabilities. To achieve this goal, the research team has made the following efforts: 1. **Constructing a large-scale embodied planning dataset (EgoCOT)**: This dataset is meticulously curated from the Ego4D dataset, accompanied by high-quality language instructions. A series of sub-goals are generated through a "chain of thought" pattern to achieve effective embodied planning. 2. **Proposing an efficient training method**: A 7 billion parameter large language model (LLM) is adapted to the EgoCOT dataset using a prefix tuning method, thereby generating high-quality plans. 3. **Introducing a paradigm**: Task-relevant features are extracted from the plans generated by the LLM queries, forming a closed loop between high-level planning and low-level control, enabling the robot to complete tasks more effectively. Experimental results show that EmbodiedGPT performs excellently in various embodied tasks, including embodied planning, embodied control, video captioning, and visual question answering. Notably, in the Franka Kitchen benchmark test, EmbodiedGPT's success rate is 1.6 times higher than the BLIP-2 baseline model, and 1.3 times higher in the Meta-World benchmark test. This indicates that EmbodiedGPT can significantly enhance the operational performance of robots.

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Towards End-to-End Embodied Decision Making via Multi-modal Large Language Model: Explorations with GPT4-Vision and Beyond

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

KGGPT: Empowering Robots with OpenAI's ChatGPT and Knowledge Graph.

Embodied Task Planning with Large Language Models

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Egocentric Planning for Scalable Embodied Task Achievement

VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI

Egocentric Vision Language Planning

EMPOWER: Embodied Multi-role Open-vocabulary Planning with Online Grounding and Execution

An Embodied Generalist Agent in 3D World

Core Challenges in Embodied Vision-Language Planning

PandaGPT: One Model to Instruction-Follow Them All.

A Survey on Vision-Language-Action Models for Embodied AI