Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning

Zhecheng Yuan,Tianming Wei,Shuiqi Cheng,Gu Zhang,Yuanpei Chen,Huazhe Xu
2024-10-23
Abstract:Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at <a class="link-external link-https" href="https://gemcollector.github.io/maniwhere/" rel="external noopener nofollow">this https URL</a>.
Robotics,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of how to enable visual motion robots to have strong generalization capabilities in complex and variable real-world environments, so that they can operate effectively in different visual scenes. Specifically, the paper focuses on how to train robot policies through reinforcement learning to cope with various types of visual disturbances (such as changes in camera perspective, appearance, lighting conditions, etc.) and transfer skills from simulated environments to real environments without recalibrating the camera. ### Main Issues 1. **Visual Generalization Capability**: Existing methods perform reasonably well when dealing with a single type of visual change, but perform poorly when faced with a combination of multiple visual changes. For example, the robot's performance significantly decreases when the camera perspective, background color, or lighting conditions change. 2. **Sim-to-Real Transfer**: The transfer from simulated environments to the real world usually requires extensive recalibration and data collection, which is very time-consuming and impractical in real applications. 3. **Robustness**: Existing methods tend to exhibit training instability or even divergence when dealing with visual changes, resulting in learned policies that cannot operate effectively in the real world. ### Solution The paper proposes a framework called Maniwhere, which aims to solve the above problems through the following technical means: - **Multi-View Representation Learning**: By using multi-view inputs and contrastive learning objectives, shared semantic information under different perspectives is extracted to enhance the model's robustness to perspective changes. - **Spatial Transformer Network (STN)**: Integrating the STN module into the visual encoder further improves the model's adaptability to perspective changes. - **Curriculum Randomization**: Using curriculum randomization and data augmentation methods, the intensity of randomization parameters is gradually increased to stabilize the reinforcement learning training process and improve visual generalization capabilities. ### Experimental Validation The paper validates the effectiveness of Maniwhere by designing 8 tasks, covering single-arm manipulation, dual-arm collaboration, and dexterous hand manipulation. Experimental results show that Maniwhere significantly outperforms existing methods in both simulated and real-world environments, especially when dealing with combinations of multiple visual changes. ### Conclusion Maniwhere successfully enhances the generalization capabilities and sim-to-real transfer capabilities of visual motion robots in complex and variable environments through techniques such as multi-view representation learning and curriculum randomization, laying the foundation for the widespread application of advanced visual motion systems.