Abstract:Robotics tasks are highly compositional by nature. For example, to perform a high-level task like cleaning the table a robot must employ low-level capabilities of moving the effectors to the objects on the table, pick them up and then move them off the table one-by-one, while re-evaluating the consequently dynamic scenario in the process. Given that large vision language models (VLMs) have shown progress on many tasks that require high level, human-like reasoning, we ask the question: if the models are taught the requisite low-level capabilities, can they compose them in novel ways to achieve interesting high-level tasks like cleaning the table without having to be explicitly taught so? To this end, we present ClevrSkills - a benchmark suite for compositional reasoning in robotics. ClevrSkills is an environment suite developed on top of the ManiSkill2 simulator and an accompanying dataset. The dataset contains trajectories generated on a range of robotics tasks with language and visual annotations as well as multi-modal prompts as task specification. The suite includes a curriculum of tasks with three levels of compositional understanding, starting with simple tasks requiring basic motor skills. We benchmark multiple different VLM baselines on ClevrSkills and show that even after being pre-trained on large numbers of tasks, these models fail on compositional reasoning in robotics tasks.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **Can robots learn low - level manipulation skills and combine these skills to complete complex high - level tasks without explicit instruction?** Specifically, the paper explores this issue in the following aspects: 1. **Background and Motivation**: - Robot tasks are highly combinatorial in nature. For example, to complete a high - level task such as cleaning a table, a robot must have low - level abilities such as moving its actuators to the objects on the table, picking them up, and removing them one by one. - Large - scale Vision - Language Models (VLMs) have made progress in many tasks that require high - level human reasoning. Therefore, researchers pose a question: if these models are taught the necessary low - level abilities, can they combine these abilities in novel ways to complete interesting high - level tasks (such as cleaning a table) without explicit instruction? 2. **Proposed Method**: - Researchers introduced ClevrSkills - a benchmark suite for robotic combinatorial reasoning. ClevrSkills is built on top of the ManiSkill2 simulator and comes with a dataset containing multiple task trajectories. These trajectories include language and visual annotations as well as multi - modal prompts as task specifications. - ClevrSkills includes three levels of combinatorial understanding tasks, starting from simple basic motion - skill tasks and gradually increasing in complexity. 3. **Evaluation and Experiment**: - Researchers benchmarked multiple different VLM baselines on ClevrSkills, and the results show that even after pre - training on a large number of tasks, these models still fail in combinatorial reasoning for robot tasks. - Through three levels of tasks (L0, L1, L2), researchers systematically tested the models' ability to combine simple motion skills to complete more complex tasks, including their performance in zero - shot and fine - tuning cases. 4. **Main Contributions**: - Introduced the ClevrSkills environment suite, which contains 33 different tasks distributed across three difficulty levels, for evaluating the combinatorial reasoning ability of robotic models. - Provided a dataset containing 330,000 real - world trajectories generated by scripted Oracle strategies, which can be used for imitation learning. - By benchmarking existing state - of - the - art Vision - Language Models, demonstrated the limitations of these models in combinatorial understanding tasks. In conclusion, this paper aims to explore and evaluate the combinatorial reasoning ability of large - scale Vision - Language Models in robot tasks, especially whether they can complete high - level tasks by combining low - level skills without explicit guidance.

ClevrSkills: Compositional Language and Visual Reasoning in Robotics

GSC: A Graph-Based Skill Composition Framework for Robot Learning

Grounding Language for Robotic Manipulation via Skill Library

A Benchmark for Compositional Visual Reasoning

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

CompoSuite: A Compositional Reinforcement Learning Benchmark

Closed Loop Interactive Embodied Reasoning for Robot Manipulation

Example-Driven Model-Based Reinforcement Learning for Solving Long-Horizon Visuomotor Tasks

CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments

Grounding Language with Visual Affordances over Unstructured Data

Abstract Visual Reasoning Enabled by Language

Probing Mechanical Reasoning in Large Vision Language Models

Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition

Multi-Level Compositional Reasoning for Interactive Instruction Following

Visuospatial Skill Learning for Robots

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

Development of Compositionality and Generalization through Interactive Learning of Language and Action of Robots

Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration

Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers