Abstract:As humans, we can naturally break down a task into individual steps in our daily lives and we are able to provide feedback or dynamically adjust the plan when encountering obstacles. Similarly, our aim is to facilitate agents in comprehending and carrying out natural language instructions in a more efficient and cost-effective manner. For example, in Vision-Language Navigation (VLN) tasks, the agent needs to understand instructions such as "go to the table by the fridge". This understanding allows the agent to navigate to the table and infer that the destination is likely to be in the kitchen. The traditional VLN approach mainly involves training models using a large number of labeled datasets for task planning in unseen environments. However, manual labeling incurs a high cost for this approach. Considering that large language models (LLMs) already possess extensive commonsense knowledge during pre-training, some researchers have started using LLMs as decision modules in embodied tasks, although this approach shows the LLMs' reasoning ability to plan a logical sequence of subtasks based on global information. However, executing subtasks often encounters issues, such as obstacles that hinder progress and alterations in the state of the target object. Even one mistake can cause the subsequent tasks to fail, which makes it challenging to complete the instructions through a single plan. Therefore, we propose a new approach-C (Correction) and P (Planning) with M (Memory) I (Integration)-that centered on an LLM for embodied tasks. In more detail, the auxiliary modules of the CPMI facilitate dynamic planning by the LLM-centric planner. These modules provide the agent with memory and generalized experience mechanisms to fully utilize the LLM capabilities, allowing it to improve its performance during execution. Finally, the experimental results on public datasets demonstrate that we achieve the best performance in the few-shot scenario, improving the efficiency of the successive task while increasing the success rate.

LLM as A Robotic Brain: Unifying Egocentric Memory and Control

CLFR-M: Continual Learning Framework for Robots Via Human Feedback and Dynamic Memory

Robots Can Multitask Too: Integrating a Memory Architecture and LLMs for Enhanced Cross-Task Robot Action Generation

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Decision-Making in Robotic Grasping with Large Language Models.

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?

Leave It to Large Language Models! Correction and Planning with Memory Integration

ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning

MHRC: Closed-loop Decentralized Multi-Heterogeneous Robot Collaboration with Large Language Models

Understanding Large-Language Model (LLM)-powered Human-Robot Interaction

When Robots Get Chatty: Grounding Multimodal Human-Robot Conversation and Collaboration

Large Language Models for Robotics: A Survey

LLM-Based Human-Robot Collaboration Framework for Manipulation Tasks

LaMI: Large Language Models for Multi-Modal Human-Robot Interaction

Prompt, Plan, Perform: LLM-based Humanoid Control via Quantized Imitation Learning

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models

Empowering Working Memory for Large Language Model Agents