Abstract:As humans, we can naturally break down a task into individual steps in our daily lives and we are able to provide feedback or dynamically adjust the plan when encountering obstacles. Similarly, our aim is to facilitate agents in comprehending and carrying out natural language instructions in a more efficient and cost-effective manner. For example, in Vision-Language Navigation (VLN) tasks, the agent needs to understand instructions such as "go to the table by the fridge". This understanding allows the agent to navigate to the table and infer that the destination is likely to be in the kitchen. The traditional VLN approach mainly involves training models using a large number of labeled datasets for task planning in unseen environments. However, manual labeling incurs a high cost for this approach. Considering that large language models (LLMs) already possess extensive commonsense knowledge during pre-training, some researchers have started using LLMs as decision modules in embodied tasks, although this approach shows the LLMs' reasoning ability to plan a logical sequence of subtasks based on global information. However, executing subtasks often encounters issues, such as obstacles that hinder progress and alterations in the state of the target object. Even one mistake can cause the subsequent tasks to fail, which makes it challenging to complete the instructions through a single plan. Therefore, we propose a new approach-C (Correction) and P (Planning) with M (Memory) I (Integration)-that centered on an LLM for embodied tasks. In more detail, the auxiliary modules of the CPMI facilitate dynamic planning by the LLM-centric planner. These modules provide the agent with memory and generalized experience mechanisms to fully utilize the LLM capabilities, allowing it to improve its performance during execution. Finally, the experimental results on public datasets demonstrate that we achieve the best performance in the few-shot scenario, improving the efficiency of the successive task while increasing the success rate.

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

Retrieval-Augmented Personalization for Multimodal Large Language Models

Reasoning with Language Model is Planning with World Model

Improving Planning with Large Language Models: A Modular Agentic Architecture

Retrieval-Augmented Hierarchical in-Context Reinforcement Learning and Hindsight Modular Reflections for Task Planning with LLMs

EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning

Leave It to Large Language Models! Correction and Planning with Memory Integration

From LLM to Conversational Agent: A Memory Enhanced Architecture with Fine-Tuning of Large Language Models

ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution

Multi-agent Planning using Visual Language Models

LLM-SAP: Large Language Models Situational Awareness Based Planning

REAPER: Reasoning based Retrieval Planning for Complex RAG Systems

Open-vocabulary Queryable Scene Representations for Real World Planning

ReasonPlanner: Enhancing Autonomous Planning in Dynamic Environments with Temporal Knowledge Graphs and LLMs

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Ask-before-Plan: Proactive Language Agents for Real-World Planning

CLMASP: Coupling Large Language Models with Answer Set Programming for Robotic Task Planning

Inner Monologue: Embodied Reasoning through Planning with Language Models