Abstract:Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: <a class="link-external link-https" href="https://robot-see-robot-do.github.io" rel="external noopener nofollow">this https URL</a>

Learning Multi-Step Manipulation Tasks from A Single Human Demonstration

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

Learning Generalizable 3D Manipulation With 10 Demonstrations

Vision-based Robot Manipulation Learning via Human Demonstrations

An Object Attribute Guided Framework for Robot Learning Manipulations from Human Demonstration Videos

Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-To-End Learning from Demonstration

Learning Multimodal Contact-Rich Skills from Demonstrations Without Reward Engineering

DITTO: Demonstration Imitation by Trajectory Transformation

From One Hand to Multiple Hands: Imitation Learning for Dexterous Manipulation From Single-Camera Teleoperation

DexMimicGen: Automated Data Generation for Bimanual Dexterous Manipulation via Imitation Learning

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Learning to Manipulate Tools by Aligning Simulation to Video Demonstration

A Human–Robot Collaboration Method Using a Pose Estimation Network for Robot Learning of Assembly Manipulation Trajectories From Demonstration Videos

Human Demonstrations are Generalizable Knowledge for Robots

Learning Cooperative Dynamic Manipulation Skills from Human Demonstration Videos

Efficient Robot Skill Learning with Imitation from a Single Video for Contact-Rich Fabric Manipulation

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos