Abstract:The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: <a class="link-external link-https" href="https://robots-pretrain-robots.github.io/" rel="external noopener nofollow">this https URL</a>.

Masked Visual-Tactile Pre-training for Robot Manipulation

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

InterRep: A Visual Interaction Representation for Robotic Grasping

VITaL Pretraining: Visuo-Tactile Pretraining for Tactile and Non-Tactile Manipulation Policies

Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods

Canonical Representation and Force-Based Pretraining of 3D Tactile for Dexterous Visuo-Tactile Policy Learning

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

Robot Synesthesia: In-Hand Manipulation with Visuotactile Sensing

Low Fidelity Visuo-Tactile Pretraining Improves Vision-Only Manipulation Performance

TacGNN:Learning Tactile-based In-hand Manipulation with a Blind Robot

Self-Supervised Visuo-Tactile Pretraining to Locate and Follow Garment Features

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

MimicTouch: Leveraging Multi-modal Human Tactile Demonstrations for Contact-rich Manipulation

Visual Robotic Manipulation with Depth-Aware Pretraining

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing

Dexterity from Touch: Self-Supervised Pre-Training of Tactile Representations with Robotic Play

Learning Self-Supervised Representations from Vision and Touch for Active Sliding Perception of Deformable Surfaces

The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning