Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

Jiaming Zhou,Teli Ma,Kun-Yu Lin,Ronghe Qiu,Zifan Wang,Junwei Liang
2024-06-20
Abstract:Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks. To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy. Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner. The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated. On the large RLBench benchmark, our adaptation method achieves an average improvement of $8.9\%$ in success rate over the pre-trained R3M model across multiple tasks. We will release the code and models upon acceptance.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of human-robot domain discrepancy in visual pre-training. Specifically, although pre-trained models on large-scale human data have potential applications in robotic manipulation tasks, these pre-trained models are difficult to directly apply to downstream robotic manipulation tasks due to significant morphological differences between humans and robots. This discrepancy leads to insufficient generalization ability of pre-trained models in practical applications. ### Solution To solve this problem, the authors propose a new adaptation paradigm that leverages paired human-robot video data to bridge the domain gap. The specific methods include: 1. **Human-Robot Contrastive Alignment Loss**: By introducing a human-robot contrastive alignment loss, the pre-trained model can better align the semantics of human and robot videos. 2. **Parameter-Efficient Adapter Modules**: Inserting learnable adapter modules into the pre-trained model to efficiently adapt to the dynamic semantics of the robotic domain. 3. **Task-Aware Feature Modeling**: Using task descriptions as queries to extract task-relevant semantic information from video features, further enhancing feature representation. ### Experimental Validation To validate the effectiveness of this method, the authors conducted the following experiments: 1. **Single Task Setting**: Evaluated the performance of two adapted pre-trained models (D4R-Align and R3M-Align) on multiple tasks in the Adroit and Metaworld environments. The results showed that the adapted models achieved significant performance improvements across all tasks. 2. **Multi-Task Setting**: Evaluated the performance of the adapted pre-trained models on 18 language-conditioned tasks in the RLBench benchmark. The results indicated that the adapted models significantly increased the success rate across multiple tasks, with the R3M-Align model achieving an average success rate improvement of 8.9% in the challenging multi-task setting. ### Main Contributions 1. **Proposed a New Adaptation Paradigm**: Effectively mitigated the human-robot domain discrepancy by leveraging paired human-robot video data while maintaining the generality of the pre-trained model. 2. **Designed an Efficient Human-Robot Semantic Alignment Method**: Achieved effective adaptation of the pre-trained model through parameter-efficient adapter modules and contrastive alignment loss. 3. **Extensive Experimental Validation**: Validated the effectiveness of the method across multiple environments and tasks, demonstrating its generalization ability in different tasks. In summary, this paper proposes a new adaptation paradigm that effectively addresses the human-robot domain discrepancy in visual pre-training, providing a new solution for the generalization ability of robotic manipulation tasks.