Abstract:The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: <a class="link-external link-https" href="https://robots-pretrain-robots.github.io/" rel="external noopener nofollow">this https URL</a>.

InterRep: A Visual Interaction Representation for Robotic Grasping

DexRepNet: Learning Dexterous Robotic Grasping Network with Geometric and Spatial Hand-Object Representations

Masked Visual-Tactile Pre-training for Robot Manipulation

Vision-Based Robotic Object Grasping—A Deep Reinforcement Learning Approach

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Human-oriented Representation Learning for Robotic Manipulation

Learning to Regrasp Using Visual–Tactile Representation-Based Reinforcement Learning

Reinforcement Learning with Decoupled State Representation for Robot Manipulations

Towards Generalization and Data Efficient Learning of Deep Robotic Grasping

Active Perception and Representation for Robotic Manipulation

Learning Cross-hand Policies for High-DOF Reaching and Grasping

NeuralGrasps: Learning Implicit Representations for Grasps of Multiple Robotic Hands

Robotic Grasping Technology Integrating Large Kernel Convolution and Residual Connections

R3M: A Universal Visual Representation for Robot Manipulation

Acceleration of Actor-Critic Deep Reinforcement Learning for Visual Grasping in Clutter by State Representation Learning Based on Disentanglement of a Raw Input Image

Novel Representation of Robotic Grasp Detection based on Residual Hourglass Architecture

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

A Surprisingly Efficient Representation for Multi-Finger Grasping

Learning high-DOF reaching-and-grasping via dynamic representation of gripper-object interaction

Cross-Embodiment Dexterous Grasping with Reinforcement Learning

Deep Vision Networks for Real-Time Robotic Grasp Detection