RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

Fanfan Liu,Feng Yan,Liming Zheng,Chengjian Feng,Yiyang Huang,Lin Ma
2024-09-12
Abstract:Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 93.0% to 96.2%, and in the $ABC \to D$ setting from 92.2% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. <a class="link-external link-https" href="https://github.com/liufanfanlff/RoboUniview" rel="external noopener nofollow">this https URL</a>
Robotics,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issue of significant performance differences in executing robotic tasks across different robotic platforms due to variations in camera specifications and installation positions. Specifically, existing Vision-Language Models (VLMs), while enhancing the model's generalization ability to new objects and instructions, struggle to accurately understand the real physical space due to different camera parameters on various platforms, thereby affecting the accuracy of their action predictions. To tackle this challenge, the paper proposes an innovative method named **RoboUniView**. This method decouples visual feature extraction and action learning by first learning a unified view representation from multi-view images and then deriving actions from this unified view representation to control robotic operations. This unified view representation more accurately reflects the physical world and is not constrained by the camera parameters of the robotic platform. ### Main Contributions 1. **Proposed a Vision-Language Model with Unified View Representation** for robotic operations, enhancing the model's performance and generalization ability under different robotic camera parameters. 2. **Designed an effective pre-training method** to achieve better understanding of the physical world. 3. **Conducted extensive experiments** to evaluate the performance of RoboUniView under various settings, showing significant advantages in the CALVIN benchmark, particularly in the D→D and ABC→D settings, where the success rate increased from 93.0% to 96.2% and from 92.2% to 94.2%, respectively. ### Experimental Results - **Imitation Performance**: In the D→D setting, RoboUniView significantly outperformed all other methods, increasing the success rate of Task 1 from 0.930 to 0.962 and the success rate of continuous task sequences from 3.300 to 3.855. - **Zero-Shot Generalization**: In the ABC→D setting, RoboUniView increased the success rate of Task 1 from 0.922 to 0.942 and the average success sequence length from 3.270 to 3.647. - **Generalization Ability to Different Camera Parameters**: In advanced experiments such as D→Duc, Dmc→D, and Djtmc→D, RoboUniView demonstrated strong generalization ability, maintaining high success rates even with unseen camera parameters. ### Conclusion By introducing a unified view representation and an effective pre-training method, RoboUniView significantly improves the performance and generalization ability of robotic operation tasks, especially under different camera parameters. This approach provides new directions for future research in the field of robotic operations.