Abstract:Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 93.0% to 96.2%, and in the $ABC \to D$ setting from 92.2% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. <a class="link-external link-https" href="https://github.com/liufanfanlff/RoboUniview" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issue of significant performance differences in executing robotic tasks across different robotic platforms due to variations in camera specifications and installation positions. Specifically, existing Vision-Language Models (VLMs), while enhancing the model's generalization ability to new objects and instructions, struggle to accurately understand the real physical space due to different camera parameters on various platforms, thereby affecting the accuracy of their action predictions. To tackle this challenge, the paper proposes an innovative method named **RoboUniView**. This method decouples visual feature extraction and action learning by first learning a unified view representation from multi-view images and then deriving actions from this unified view representation to control robotic operations. This unified view representation more accurately reflects the physical world and is not constrained by the camera parameters of the robotic platform. ### Main Contributions 1. **Proposed a Vision-Language Model with Unified View Representation** for robotic operations, enhancing the model's performance and generalization ability under different robotic camera parameters. 2. **Designed an effective pre-training method** to achieve better understanding of the physical world. 3. **Conducted extensive experiments** to evaluate the performance of RoboUniView under various settings, showing significant advantages in the CALVIN benchmark, particularly in the D→D and ABC→D settings, where the success rate increased from 93.0% to 96.2% and from 92.2% to 94.2%, respectively. ### Experimental Results - **Imitation Performance**: In the D→D setting, RoboUniView significantly outperformed all other methods, increasing the success rate of Task 1 from 0.930 to 0.962 and the success rate of continuous task sequences from 3.300 to 3.855. - **Zero-Shot Generalization**: In the ABC→D setting, RoboUniView increased the success rate of Task 1 from 0.922 to 0.942 and the average success sequence length from 3.270 to 3.647. - **Generalization Ability to Different Camera Parameters**: In advanced experiments such as D→Duc, Dmc→D, and Djtmc→D, RoboUniView demonstrated strong generalization ability, maintaining high success rates even with unseen camera parameters. ### Conclusion By introducing a unified view representation and an effective pre-training method, RoboUniView significantly improves the performance and generalization ability of robotic operation tasks, especially under different camera parameters. This approach provides new directions for future research in the field of robotic operations.

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Vision-Language Foundation Models as Effective Robot Imitators

VIEW: Visual Imitation Learning with Waypoints

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study

RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation

Calibration-Free Monocular Vision-Based Robot Manipulations With Occlusion Awareness

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Vision-Based Efficient Robotic Manipulation with a Dual-Streaming Compact Convolutional Transformer

OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Learning Autonomous Viewpoint Adjustment from Human Demonstrations for Telemanipulation

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots