Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Jiafei Duan,Wentao Yuan,Wilbert Pumacay,Yi Ru Wang,Kiana Ehsani,Dieter Fox,Ranjay Krishna

2024-08-30

Abstract:Large-scale endeavors like and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 7 real-world and 14 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything's demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe Manipulate-Anything can be a scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Project page: <a class="link-external link-https" href="https://robot-ma.github.io/" rel="external noopener nofollow">this https URL</a>.

Robotics,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the issues of insufficient quality, quantity, and diversity of robot operation data in the real world. Although large-scale data collection efforts such as RT-1 and Open-X-Embodiment have made some progress, existing methods still have the following limitations: 1. **Require privileged state information**: Many existing methods rely on privileged state information provided in simulated environments, which is difficult to obtain in the real world. 2. **Require manually designed skills**: Many methods require a pre-designed skill library, which limits the flexibility and adaptability of robots. 3. **Can only handle a few specific objects**: Existing methods usually can only handle a limited number of objects with known geometries. To overcome these limitations, the paper proposes a method called MANIPULATE-ANYTHING, which can automatically generate high-quality robot operation data and operate any static object without privileged state information. Specifically, MANIPULATE-ANYTHING has the following features: - **No privileged state information required**: It can operate in real-world environments without additional state information. - **No manually designed skills required**: It can automatically generate the sub-tasks and actions needed for the task. - **Operate any object**: It can handle a diverse range of objects, not just a few specific instances. With these improvements, MANIPULATE-ANYTHING can not only generate high-quality data but also solve new tasks in a zero-shot setting, and the generated data can be used to train more robust behavior cloning policies.

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

Open-World Object Manipulation using Pre-trained Vision-Language Models

Watch and Act: Learning Robotic Manipulation from Visual Demonstration.

Vision-based Robot Manipulation Learning via Human Demonstrations

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Vision-based Manipulation from Single Human Video with Open-World Object Graphs

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Multiple Interactions Made Easy (MIME): Large Scale Demonstrations Data for Imitation

An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos

MoDem-V2: Visuo-Motor World Models for Real-World Robot Manipulation

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Learning Robotic Manipulation through Visual Planning and Acting

Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations

Design of Demonstration-Driven Assembling Manipulator

TeleMoMa: A Modular and Versatile Teleoperation System for Mobile Manipulation

Learning Manipulation by Predicting Interaction