Abstract:The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss. Empirical results across 4 simulation domains with 20 tasks verify that MCR outperforms the strongest baseline method by 14.8%. Moreover, MCR boosts the performance of data-efficient learning with a UR5e arm on 3 real-world tasks by 76.9%. Project website: <a class="link-external link-https" href="https://robots-pretrain-robots.github.io/" rel="external noopener nofollow">this https URL</a>.

On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline.

Masked Visual-Tactile Pre-training for Robot Manipulation

Mastering Robot Control Through Point-based Reinforcement Learning with Pre-training.

The Unsurprising Effectiveness of Pre-Trained Vision Models for Control

For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal

Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods

An Unbiased Look at Datasets for Visuo-Motor Pre-Training

Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation

VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving

GPPF: A General Perception Pre-training Framework via Sparsely Activated Multi-Task Learning

Visual Robotic Manipulation with Depth-Aware Pretraining

Autoregressive Pretraining with Mamba in Vision

Investigating Pre-Training Objectives for Generalization in Vision-Based Reinforcement Learning

Rethinking Overlooked Aspects in Vision-Language Models

Contrastive-Adversarial and Diffusion: Exploring pre-training and fine-tuning strategies for sulcal identification

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Real-time Vision-Language-Navigation based on a Lite Pre-training Model

Spatiotemporal Predictive Pre-training for Robotic Motor Control

Improved Baselines with Visual Instruction Tuning

Are we pretraining it right? Digging deeper into visio-linguistic pretraining

VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training