Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black,Mitsuhiko Nakamoto,Pranav Atreya,Homer Walke,Chelsea Finn,Aviral Kumar,Sergey Levine

2023-10-17

Abstract:If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image-editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller can accomplish. Specifically, we finetune InstructPix2Pix on video data, consisting of both human videos and robot rollouts, such that it outputs hypothetical future "subgoal" observations given the robot's current observation and a language command. We also use the robot data to train a low-level goal-conditioned policy to act as the aforementioned low-level controller. We find that the high-level subgoal predictions can utilize Internet-scale pretraining and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization and precision than conventional language-conditioned policies. We achieve state-of-the-art results on the CALVIN benchmark, and also demonstrate robust generalization on real-world manipulation tasks, beating strong baselines that have access to privileged information or that utilize orders of magnitude more compute and training data. The project website can be found at <a class="link-external link-http" href="http://rail-berkeley.github.io/susie" rel="external noopener nofollow">this http URL</a> .

Robotics

What problem does this paper attempt to address?

The paper aims to address the problem of robots recognizing and handling new objects and scenes in truly unstructured environments. Specifically, the research proposes the SuSIE method, which utilizes a pre-trained image editing diffusion model as a high-level planner to generate intermediate sub-goals, executed by a low-level controller. This approach leverages internet-scale pre-trained data and visual understanding to guide low-level goal-conditioned policies, achieving better generalization and accuracy compared to traditional language-conditioned policies. In the experimental section, the research demonstrates the latest results of SuSIE on the CALVIN benchmark and shows outstanding performance in real-world manipulation tasks, surpassing strong baseline models that have privileged information or use large computational resources and training data.

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

Scaling Robot Learning with Semantically Imagined Experience

Incorporating Task Progress Knowledge for Subgoal Generation in Robotic Manipulation through Image Edits

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

SAGCI-System: Towards Sample-Efficient, Generalizable, Compositional, and Incremental Robot Learning

Efficient Robot Skill Learning with Imitation from a Single Video for Contact-Rich Fabric Manipulation

Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations

Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

Learning Robotic Manipulation through Visual Planning and Acting

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Learning Generalizable 3D Manipulation With 10 Demonstrations

Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Local Policies Enable Zero-shot Long-horizon Manipulation

Open-World Object Manipulation using Pre-trained Vision-Language Models

Precise and Dexterous Robotic Manipulation via Human-in-the-Loop Reinforcement Learning

Mirage: Cross-Embodiment Zero-Shot Policy Transfer with Cross-Painting