LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation

Shengqiang Zhang,Philipp Wicke,Lütfi Kerem Şenel,Luis Figueredo,Abdeldjallil Naceri,Sami Haddadin,Barbara Plank,Hinrich Schütze
2023-10-23
Abstract:The convergence of embodied agents and large language models (LLMs) has brought significant advancements to embodied instruction following. Particularly, the strong reasoning capabilities of LLMs make it possible for robots to perform long-horizon tasks without expensive annotated demonstrations. However, public benchmarks for testing the long-horizon reasoning capabilities of language-conditioned robots in various scenarios are still missing. To fill this gap, this work focuses on the tabletop manipulation task and releases a simulation benchmark, \textit{LoHoRavens}, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. Furthermore, there is a key modality bridging problem for long-horizon manipulation tasks with LLMs: how to incorporate the observation feedback during robot execution for the LLM's closed-loop planning, which is however less studied by prior work. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM, respectively. These methods serve as the two baselines for our proposed benchmark. Experiments show that both methods struggle to solve some tasks, indicating long-horizon manipulation tasks are still challenging for current popular models. We expect the proposed public benchmark and baselines can help the community develop better models for long-horizon tabletop manipulation tasks.
Robotics,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem Addressed by the Paper This paper aims to address the issue of long-term language-conditioned benchmarks in robotic tabletop manipulation tasks. Specifically, the authors focus on developing a public benchmark platform that can evaluate a robot's reasoning capabilities when performing complex, long-term tasks. Although large language models (LLMs) have made significant progress in enabling robots to understand high-level instructions, there is currently a lack of a comprehensive benchmark that can test these models' long-term reasoning abilities across various scenarios. #### The Main Issues Include: 1. **Reasoning Ability for Long-Term Tasks**: Most current benchmarks either do not focus on long-term tasks or are not language-conditioned. Long-term tasks typically require robots to perform multiple steps and involve complex reasoning abilities, such as reasoning about color, size, spatial location, arithmetic, and references. 2. **Modal Bridging Problem**: In long-term tasks, how to integrate observational feedback into the closed-loop planning of LLMs in real-time remains an under-explored issue. This involves converting visual observations into language or implicit representations that LLMs can understand. #### Solutions: - **LoHoRavens Benchmark**: The authors propose a simulated benchmark platform called LoHoRavens, specifically designed to test the long-term language-conditioned reasoning abilities of robots in tabletop manipulation tasks. This benchmark includes 10 long-term tasks, divided into seen and unseen tasks, to evaluate the generalization performance of robots. - **Modal Bridging Methods**: To overcome the modal bridging problem, the authors investigate two methods: - **Explicit Feedback**: By generating natural language feedback that describes the observation state and action success state, it helps LLMs in closed-loop planning. - **Implicit Feedback**: By training a multi-layer perceptron (MLP) to convert visual embeddings into token embeddings that LLMs can accept, it provides implicit feedback. #### Experimental Results: - The experimental results show that even the most advanced models perform quite poorly on the LoHoRavens benchmark, indicating that long-term language-conditioned tasks remain challenging. - Explicit and implicit feedback methods have their respective advantages in different tasks, but overall, they both struggle to solve all tasks. - Performance on unseen tasks drops significantly, but the explicit feedback method is more robust on unseen tasks. In summary, the introduction of the LoHoRavens benchmark fills the gap in long-term language-conditioned task benchmarking, providing an important tool and baseline for future research.