Abstract:As humans, we can naturally break down a task into individual steps in our daily lives and we are able to provide feedback or dynamically adjust the plan when encountering obstacles. Similarly, our aim is to facilitate agents in comprehending and carrying out natural language instructions in a more efficient and cost-effective manner. For example, in Vision-Language Navigation (VLN) tasks, the agent needs to understand instructions such as "go to the table by the fridge". This understanding allows the agent to navigate to the table and infer that the destination is likely to be in the kitchen. The traditional VLN approach mainly involves training models using a large number of labeled datasets for task planning in unseen environments. However, manual labeling incurs a high cost for this approach. Considering that large language models (LLMs) already possess extensive commonsense knowledge during pre-training, some researchers have started using LLMs as decision modules in embodied tasks, although this approach shows the LLMs' reasoning ability to plan a logical sequence of subtasks based on global information. However, executing subtasks often encounters issues, such as obstacles that hinder progress and alterations in the state of the target object. Even one mistake can cause the subsequent tasks to fail, which makes it challenging to complete the instructions through a single plan. Therefore, we propose a new approach-C (Correction) and P (Planning) with M (Memory) I (Integration)-that centered on an LLM for embodied tasks. In more detail, the auxiliary modules of the CPMI facilitate dynamic planning by the LLM-centric planner. These modules provide the agent with memory and generalized experience mechanisms to fully utilize the LLM capabilities, allowing it to improve its performance during execution. Finally, the experimental results on public datasets demonstrate that we achieve the best performance in the few-shot scenario, improving the efficiency of the successive task while increasing the success rate.

LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models

Language Models Meet World Models: Embodied Experiences Enhance Language Models

LEGENT: Open Platform for Embodied Agents

Building Cooperative Embodied Agents Modularly with Large Language Models

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

Embodied Task Planning with Large Language Models

Inner Monologue: Embodied Reasoning through Planning with Language Models

OPEx: A Component-Wise Analysis of LLM-Centric Agents in Embodied Instruction Following

Grounding Large Language Models In Embodied Environment With Imperfect World Models

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

Leave It to Large Language Models! Correction and Planning with Memory Integration

A little less conversation, a little more action, please: Investigating the physical common-sense of LLMs in a 3D embodied environment

Chat with the Environment: Interactive Multimodal Perception Using Large Language Models

MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments

On Grounded Planning for Embodied Tasks with Language Models

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives