This&That: Language-Gesture Controlled Video Generation for Robot Planning

Boyang Wang,Nikhil Sridhar,Chao Feng,Mark Van der Merwe,Adam Fishman,Nima Fazeli,Jeong Joon Park

2024-07-08

Abstract:We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution. Project website: <a class="link-external link-https" href="https://cfeng16.github.io/this-and-that/" rel="external noopener nofollow">this https URL</a>.

Robotics,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address three core challenges in robotic tasks: (1) achieving unambiguous task communication through simple instructions; (2) generating controllable videos to respect user intentions; (3) translating visual planning into actual robotic operations. The authors propose a method called This&That, which leverages video generation models trained on large-scale internet data to achieve task planning in complex and uncertain environments. Specifically, the method introduces language-gesture conditions to generate videos, which is more intuitive and clear compared to existing pure language methods. Additionally, the paper proposes a behavior cloning design that seamlessly integrates the generated video plans into actual robotic operations. Experimental results show that This&That achieves state-of-the-art performance in addressing the aforementioned three challenges and demonstrates the effectiveness of video generation as an intermediate representation for general task planning and execution.

This&That: Language-Gesture Controlled Video Generation for Robot Planning

Video Language Planning

VideoAgent: Self-Improving Video Generation

Learning Robotic Manipulation through Visual Planning and Acting

See, Plan, Predict: Language-guided Cognitive Planning with Video Prediction

Lifelong Robot Learning with Human Assisted Language Planners

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

Generative Expressive Robot Behaviors using Large Language Models

$π_0$: A Vision-Language-Action Flow Model for General Robot Control

Interactive Task Planning with Language Models

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Learning Neuro-symbolic Programs for Language Guided Robot Manipulation

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

Multi-Modal Grounded Planning and Efficient Replanning For Learning Embodied Agents with A Few Examples

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

RoboDreamer: Learning Compositional World Models for Robot Imagination

GPT-4V(ision) for Robotics: Multimodal Task Planning from Human Demonstration

Vision-Language Interpreter for Robot Task Planning

Visual Robot Task Planning

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation