Abstract:While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, SPHINX learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, SPHINX tackles complex tasks in a sample-efficient, generalizable manner. Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline. Our website (<a class="link-external link-http" href="http://sphinx-manip.github.io" rel="external noopener nofollow">this http URL</a>) provides open-sourced code for data collection, training, and evaluation, along with supplementary videos.

RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Efficient Robot Skill Learning with Imitation from a Single Video for Contact-Rich Fabric Manipulation

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Part-Guided 3D RL for Sim2Real Articulated Object Manipulation

Learning from demonstrations: An intuitive VR environment for imitation learning of construction robots

Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

Spatial-Language Attention Policies for Efficient Robot Learning

SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Reconciling Reality through Simulation: A Real-to-Sim-to-Real Approach for Robust Manipulation

RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation

3D-RPP: a Novel 3D Vision-Based Pose Perception Approach for Industrial Robots

What's the Move? Hybrid Imitation Learning via Salient Points

VIRT: Vision Instructed Transformer for Robotic Manipulation

Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies

Dreamitate: Real-World Visuomotor Policy Learning via Video Generation

Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection

Generalizable Humanoid Manipulation with Improved 3D Diffusion Policies

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation

Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots