Abstract:The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to address the issue of long-term language-conditioned benchmarks in robotic tabletop manipulation tasks. Specifically, the authors focus on developing a public benchmark platform that can evaluate a robot's reasoning capabilities when performing complex, long-term tasks. Although large language models (LLMs) have made significant progress in enabling robots to understand high-level instructions, there is currently a lack of a comprehensive benchmark that can test these models' long-term reasoning abilities across various scenarios. #### The Main Issues Include: 1. **Reasoning Ability for Long-Term Tasks**: Most current benchmarks either do not focus on long-term tasks or are not language-conditioned. Long-term tasks typically require robots to perform multiple steps and involve complex reasoning abilities, such as reasoning about color, size, spatial location, arithmetic, and references. 2. **Modal Bridging Problem**: In long-term tasks, how to integrate observational feedback into the closed-loop planning of LLMs in real-time remains an under-explored issue. This involves converting visual observations into language or implicit representations that LLMs can understand. #### Solutions: - **LoHoRavens Benchmark**: The authors propose a simulated benchmark platform called LoHoRavens, specifically designed to test the long-term language-conditioned reasoning abilities of robots in tabletop manipulation tasks. This benchmark includes 10 long-term tasks, divided into seen and unseen tasks, to evaluate the generalization performance of robots. - **Modal Bridging Methods**: To overcome the modal bridging problem, the authors investigate two methods: - **Explicit Feedback**: By generating natural language feedback that describes the observation state and action success state, it helps LLMs in closed-loop planning. - **Implicit Feedback**: By training a multi-layer perceptron (MLP) to convert visual embeddings into token embeddings that LLMs can accept, it provides implicit feedback. #### Experimental Results: - The experimental results show that even the most advanced models perform quite poorly on the LoHoRavens benchmark, indicating that long-term language-conditioned tasks remain challenging. - Explicit and implicit feedback methods have their respective advantages in different tasks, but overall, they both struggle to solve all tasks. - Performance on unseen tasks drops significantly, but the explicit feedback method is more robust on unseen tasks. In summary, the introduction of the LoHoRavens benchmark fills the gap in long-term language-conditioned task benchmarking, providing an important tool and baseline for future research.

LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

LEMMA: Learning Language-Conditioned Multi-Robot Manipulation

CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks

Generalizable Long-Horizon Manipulations with Large Language Models

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

RMBench: Benchmarking Deep Reinforcement Learning for Robotic Manipulator Control

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

Retrieval-Augmented Hierarchical in-Context Reinforcement Learning and Hindsight Modular Reflections for Task Planning with LLMs

Grounding Language Models in Autonomous Loco-manipulation Tasks

FurnitureBench: Reproducible Real-World Benchmark for Long-Horizon Complex Manipulation

ACPBench: Reasoning about Action, Change, and Planning

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

LHManip: A Dataset for Long-Horizon Language-Grounded Manipulation Tasks in Cluttered Tabletop Environments

AgentBench: Evaluating LLMs as Agents

NaturalVLM: Leveraging Fine-grained Natural Language for Affordance-Guided Visual Manipulation

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks

MARS: Benchmarking the Metaphysical Reasoning Abilities of Language Models with a Multi-task Evaluation Dataset