Abstract:As humans, we can naturally break down a task into individual steps in our daily lives and we are able to provide feedback or dynamically adjust the plan when encountering obstacles. Similarly, our aim is to facilitate agents in comprehending and carrying out natural language instructions in a more efficient and cost-effective manner. For example, in Vision-Language Navigation (VLN) tasks, the agent needs to understand instructions such as "go to the table by the fridge". This understanding allows the agent to navigate to the table and infer that the destination is likely to be in the kitchen. The traditional VLN approach mainly involves training models using a large number of labeled datasets for task planning in unseen environments. However, manual labeling incurs a high cost for this approach. Considering that large language models (LLMs) already possess extensive commonsense knowledge during pre-training, some researchers have started using LLMs as decision modules in embodied tasks, although this approach shows the LLMs' reasoning ability to plan a logical sequence of subtasks based on global information. However, executing subtasks often encounters issues, such as obstacles that hinder progress and alterations in the state of the target object. Even one mistake can cause the subsequent tasks to fail, which makes it challenging to complete the instructions through a single plan. Therefore, we propose a new approach-C (Correction) and P (Planning) with M (Memory) I (Integration)-that centered on an LLM for embodied tasks. In more detail, the auxiliary modules of the CPMI facilitate dynamic planning by the LLM-centric planner. These modules provide the agent with memory and generalized experience mechanisms to fully utilize the LLM capabilities, allowing it to improve its performance during execution. Finally, the experimental results on public datasets demonstrate that we achieve the best performance in the few-shot scenario, improving the efficiency of the successive task while increasing the success rate.

Task Navigator: Decomposing Complex Tasks for Multimodal Large Language Models

TaskLAMA: Probing the Complex Task Understanding of Language Models

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Navigating Complexity: Orchestrated Problem Solving with Multi-Agent LLMs

Enhancing Subtask Performance of Multi-modal Large Language Model

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

Small LLMs Are Weak Tool Learners: A Multi-LLM Agent

MLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

KnowledgeNavigator: Leveraging Large Language Models for Enhanced Reasoning over Knowledge Graph

Model Composition for Multimodal Large Language Models

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

InfMLLM: A Unified Framework for Visual-Language Tasks.

Leave It to Large Language Models! Correction and Planning with Memory Integration

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Exploring Graph Structure Comprehension Ability of Multimodal Large Language Models: Case Studies

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents