Abstract:We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables diverse generalizable robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. <a class="link-external link-https" href="https://homangab.github.io/track2act/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem this paper attempts to solve is the development of a general-purpose robotic operating system that can be directly deployed in new scenarios without requiring additional training or self-practice during testing. Specifically, the authors aim to achieve a robot capable of performing various daily tasks, including interacting with unseen objects in new environments. Traditional approaches typically rely on a large amount of demonstration data to achieve this generalization capability, but collecting such data is difficult and costly. Therefore, this paper proposes a method that utilizes internet video to predict feasible interaction plans and combines a small amount of specific robot interaction data to learn task-agnostic transformations, thereby obtaining real-world robot actions. ### Main Contributions 1. **Predicting Interaction Plans**: Developed a framework that can predict entity-agnostic interaction plans from diverse internet videos, in the form of point trajectories. 2. **Direct Manipulation**: Demonstrated how to use the interaction plan prediction model to obtain 3D rigid body transformations in the robot environment, achieving direct manipulation without using any robot data or online exploration. 3. **Residual Policy Correction**: Showed how to learn a goal-conditioned residual policy through a small number of task demonstrations with specific entities (about 400 trajectories) to correct errors in each step of the predicted plan, achieving closed-loop deployment. ### Method Overview - **Point Trajectory Prediction**: Starting from an initial image, a goal image, and an initial set of points, predict the future positions of the points. This step utilizes the DiT architecture based on diffusion models. - **Coarse Manipulation Trajectory Inference**: Based on the predicted point trajectories, infer the sequence of rigid body transformations of the objects, thereby generating the open-loop action trajectory of the robot's end-effector. - **Closed-Loop Manipulation**: By learning a residual policy, correct each step of the open-loop actions to improve the accuracy and robustness of the manipulation. ### Experimental Setup - **Experimental Scenarios**: Conducted manipulation experiments using the Boston Dynamics Spot robot in various living rooms, offices, and kitchens. - **Evaluation Metrics**: Achieved different levels of generalization capabilities through success rates, including mild generalization (MG), standard generalization (G), compositional generalization (CG), and type generalization (TG). ### Results - **Qualitative Results**: Showcased point trajectory prediction results on different datasets, validating the model's generalization capability and the reasonableness of the predictions. - **Quantitative Results**: Demonstrated the superior performance of the proposed method in point trajectory prediction and robotic manipulation tasks through comparisons with various baseline methods. Overall, the proposed method can achieve broad generalization to new tasks and new scenarios without relying on a large amount of specific robot data, which has significant practical application value.

Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans

Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

Learning dexterity from human hand motion in internet videos

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

Robotic Telekinesis: Learning a Robotic Hand Imitator by Watching Humans on Youtube

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos

Robobarista: Object Part based Transfer of Manipulation Trajectories from Crowd-sourcing in 3D Pointclouds

Robot Trajectron: Trajectory Prediction-based Shared Control for Robot Manipulation

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Learning Manipulation by Predicting Interaction

Learning Generalizable 3D Manipulation With 10 Demonstrations

Learning Robotic Manipulation through Visual Planning and Acting

A Human–Robot Collaboration Method Using a Pose Estimation Network for Robot Learning of Assembly Manipulation Trajectories From Demonstration Videos

Scaling Manipulation Learning with Visual Kinematic Chain Prediction