Abstract:Foundation models trained on internet-scale data, such as Vision-Language Models (VLMs), excel at performing tasks involving common sense, such as visual question answering. Despite their impressive capabilities, these models cannot currently be directly applied to challenging robot manipulation problems that require complex and precise continuous reasoning. Task and Motion Planning (TAMP) systems can control high-dimensional continuous systems over long horizons through combining traditional primitive robot operations. However, these systems require detailed model of how the robot can impact its environment, preventing them from directly interpreting and addressing novel human objectives, for example, an arbitrary natural language goal. We propose deploying VLMs within TAMP systems by having them generate discrete and continuous language-parameterized constraints that enable TAMP to reason about open-world concepts. Specifically, we propose algorithms for VLM partial planning that constrain a TAMP system's discrete temporal search and VLM continuous constraints interpretation to augment the traditional manipulation constraints that TAMP systems seek to satisfy. We demonstrate our approach on two robot embodiments, including a real world robot, across several manipulation tasks, where the desired objectives are conveyed solely through language.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the open-world problems in robotic manipulation tasks. Specifically, the authors seek to combine the capabilities of Vision-Language Models (VLMs) and Task and Motion Planning (TAMP) systems to enable robots to understand and execute complex, multi-step manipulation tasks described in natural language. ### Background and Challenges 1. **Advantages of Vision-Language Models (VLMs)**: - VLMs excel in tasks involving common sense, such as visual question answering. - However, VLMs cannot be directly applied to robotic manipulation tasks that require complex continuous reasoning because they cannot output continuous values (e.g., joint angles, grasp positions, etc.). 2. **Advantages of Task and Motion Planning (TAMP) Systems**: - TAMP systems can control high-dimensional continuous systems by combining traditional robotic basic operations to solve complex long-term tasks. - However, TAMP systems require detailed environmental models, which limits their ability to handle new human goals, such as arbitrary natural language goals. ### Solution The authors propose a method called OWL-TAMP (Open-World Language-based TAMP) that integrates VLMs into TAMP systems in the following ways: 1. **Generating Discrete Constraints**: - Using VLMs to generate discrete language-parameterized constraints that can guide the discrete-time search of the TAMP system. 2. **Generating Continuous Constraints**: - Using VLMs to generate continuous constraint functions that can interpret open-world concepts and enhance the traditional operational constraints of the TAMP system. 3. **Iterative Optimization of Constraints**: - Refining the generated constraint functions by iteratively re-prompting the VLM to ensure these constraints accurately reflect the task requirements. ### Experimental Results 1. **Simulated Environment Experiments**: - Experiments were conducted on three tasks in the RAVENS-YCB manipulation environment: strawberry placement, cup flipping, and plate setting. - The results show that the OWL-TAMP method achieved the highest success rate in these three tasks, particularly excelling in the "strawberry placement" and "cup flipping" tasks. 2. **Real Robot Experiments**: - The method was implemented on a custom dual-arm manipulator and successfully completed 10 different real-world manipulation tasks. ### Conclusion By combining the capabilities of VLMs and TAMP systems, this paper addresses the open-world problems in robotic manipulation tasks. Experimental results indicate that the OWL-TAMP method has a high success rate in handling complex tasks described in natural language, demonstrating its potential in practical applications.

Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

Open-vocabulary Queryable Scene Representations for Real World Planning

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

LLM3:Large Language Model-based Task and Motion Planning with Motion Failure Reasoning

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wonderful Team: Zero-Shot Physical Task Planning with Visual LLMs

Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation

Extended Task and Motion Planning of Long-horizon Robot Manipulation

Non-Prehensile Tool-Object Manipulation by Integrating LLM-Based Planning and Manoeuvrability-Driven Controls

Towards Open-World Grasping with Large Vision-Language Models

AP-VLM: Active Perception Enabled by Vision-Language Models

Integrating Action Knowledge and LLMs for Task Planning and Situation Handling in Open Worlds

VLMPC: Vision-Language Model Predictive Control for Robotic Manipulation

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

Topological Planning with Transformers for Vision-and-Language Navigation

Grounding Language Models in Autonomous Loco-manipulation Tasks

DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation

Ontology-driven Prompt Tuning for LLM-based Task and Motion Planning