Abstract:Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks. To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy. Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner. The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated. On the large RLBench benchmark, our adaptation method achieves an average improvement of $8.9\%$ in success rate over the pre-trained R3M model across multiple tasks. We will release the code and models upon acceptance.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of human-robot domain discrepancy in visual pre-training. Specifically, although pre-trained models on large-scale human data have potential applications in robotic manipulation tasks, these pre-trained models are difficult to directly apply to downstream robotic manipulation tasks due to significant morphological differences between humans and robots. This discrepancy leads to insufficient generalization ability of pre-trained models in practical applications. ### Solution To solve this problem, the authors propose a new adaptation paradigm that leverages paired human-robot video data to bridge the domain gap. The specific methods include: 1. **Human-Robot Contrastive Alignment Loss**: By introducing a human-robot contrastive alignment loss, the pre-trained model can better align the semantics of human and robot videos. 2. **Parameter-Efficient Adapter Modules**: Inserting learnable adapter modules into the pre-trained model to efficiently adapt to the dynamic semantics of the robotic domain. 3. **Task-Aware Feature Modeling**: Using task descriptions as queries to extract task-relevant semantic information from video features, further enhancing feature representation. ### Experimental Validation To validate the effectiveness of this method, the authors conducted the following experiments: 1. **Single Task Setting**: Evaluated the performance of two adapted pre-trained models (D4R-Align and R3M-Align) on multiple tasks in the Adroit and Metaworld environments. The results showed that the adapted models achieved significant performance improvements across all tasks. 2. **Multi-Task Setting**: Evaluated the performance of the adapted pre-trained models on 18 language-conditioned tasks in the RLBench benchmark. The results indicated that the adapted models significantly increased the success rate across multiple tasks, with the R3M-Align model achieving an average success rate improvement of 8.9% in the challenging multi-task setting. ### Main Contributions 1. **Proposed a New Adaptation Paradigm**: Effectively mitigated the human-robot domain discrepancy by leveraging paired human-robot video data while maintaining the generality of the pre-trained model. 2. **Designed an Efficient Human-Robot Semantic Alignment Method**: Achieved effective adaptation of the pre-trained model through parameter-efficient adapter modules and contrastive alignment loss. 3. **Extensive Experimental Validation**: Validated the effectiveness of the method across multiple environments and tasks, demonstrating its generalization ability in different tasks. In summary, this paper proposes a new adaptation paradigm that effectively addresses the human-robot domain discrepancy in visual pre-training, providing a new solution for the generalization ability of robotic manipulation tasks.

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

Masked Visual-Tactile Pre-training for Robot Manipulation

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

HRP: Human Affordances for Robotic Pre-Training

Visual Robotic Manipulation with Depth-Aware Pretraining

Human-oriented Representation Learning for Robotic Manipulation

Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods

Learning Visual Robotic Control Efficiently with Contrastive Pre-training and Data Augmentation

Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration

R3M: A Universal Visual Representation for Robot Manipulation

Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations

VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation

Learning Manipulation by Predicting Interaction

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers

What Makes Pre-Trained Visual Representations Successful for Robust Manipulation?

Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations

Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Learning Autonomous Viewpoint Adjustment from Human Demonstrations for Telemanipulation

Transferring Foundation Models for Generalizable Robotic Manipulation