Abstract:Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension capabilities to formulate precise and efficient action plans based on natural language instructions. However, for embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. This study provides a comprehensive overview of the emerging integration of LLMs and multimodal LLMs into various robotic tasks. Additionally, we propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions. Our results, based on diverse datasets, indicate that GPT-4V effectively enhances robot performance in embodied tasks. This extensive survey and evaluation of LLMs and multimodal LLMs across a variety of robotic tasks enriches the understanding of LLM-centric embodied intelligence and provides forward-looking insights toward bridging the gap in Human-Robot-Environment interaction.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper explores the application of large language models (LLMs) in robotic task planning, aiming to address the following key issues: 1. **Integration of Text and Physical Environment**: - **Problem Description**: Existing text-based LLMs often face challenges when dealing with tasks that require robots to interact with complex environments due to a lack of compatibility with visual perception. - **Solution**: The paper proposes a framework that utilizes multimodal GPT-4V, combining natural language instructions with robotic visual perception to enhance the performance of robots in specific tasks. 2. **Multimodal Task Planning**: - **Problem Description**: Traditional LLMs mainly focus on text understanding and generation, while robotic tasks usually require handling data from multiple modalities (such as text, images, sounds, etc.). - **Solution**: By integrating multimodal information, the paper evaluates the effectiveness of GPT-4V in different environments and scenarios, demonstrating its potential in multimodal task planning. 3. **Enhancement of Human-Robot Interaction**: - **Problem Description**: Current human-robot interaction technologies often struggle to achieve efficient and natural communication when handling complex tasks. - **Solution**: Through comprehensive research and experiments, the paper provides insights on how to utilize LLMs to improve human-robot interaction, particularly by guiding robot behavior through natural language instructions. 4. **Training of General Robotic Strategies**: - **Problem Description**: Robots need to possess a certain level of generality and adaptability when performing tasks, but existing methods have limitations in this aspect. - **Solution**: The paper summarizes the technical methods of LLMs in the field of robotics, explores the potential of training general robotic strategies, and provides foundational research for researchers. ### Summary The core issue of the paper is to explore how to overcome the challenges faced by robotic technology and leverage the achievements of LLMs in other fields to advance robotic technology. Specifically, through comprehensive research, technical evaluation, and experimental validation, the paper proposes methods to enhance robotic task planning using multimodal GPT-4V, aiming to achieve more efficient and natural human-robot interaction.

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Large Language Models for Robotics: A Survey

A Survey on Integration of Large Language Models with Intelligent Robots

Decision-Making in Robotic Grasping with Large Language Models.

Enhancing Robot Task Planning and Execution through Multi-Layer Large Language Models

MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models

Embodied AI in Mobile Robots: Coverage Path Planning with Large Language Models

Embodied intelligence in manufacturing: leveraging large language models for autonomous industrial robotics

Understanding Large-Language Model (LLM)-powered Human-Robot Interaction

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

Towards Human Awareness in Robot Task Planning with Large Language Models

Integration of LLMs and the Physical World: Research and Application

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Interpreting and learning voice commands with a Large Language Model for a robot system

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

Large Language Models for Orchestrating Bimanual Robots

MLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model

Evaluating Large Language Models with RAG Capability: A Perspective from Robot Behavior Planning and Execution

TPTU: Large Language Model-based AI Agents for Task Planning and Tool Usage