Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

Jiazhao Zhang,Kunyu Wang,Shaoan Wang,Minghan Li,Haoran Liu,Songlin Wei,Zhongyuan Wang,Zhizheng Zhang,He Wang
2024-12-09
Abstract:A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve a unified model design in diverse navigation tasks, so as to enable seamless navigation in unseen real - world environments. Specifically, the paper proposes the Uni - NaVid model, aiming to integrate four common embodied navigation tasks: Vision - and - Language Navigation (VLN), Object Goal Navigation, Embodied Question Answering (EQA) and Human Following. These tasks require the model to be able to understand natural - language instructions, recognize objects in the environment, answer questions about the environment and track specific human targets. Uni - NaVid addresses these problems in the following ways: 1. **Unified task modeling**: Uni - NaVid integrates different navigation tasks into one model by unifying the input and output data configurations, thereby achieving general processing for multiple tasks. 2. **Efficient model design**: To improve the efficiency of the model, Uni - NaVid adopts an online visual token merging mechanism to compress the visual information of historical frames while retaining fine - grained spatial information and structured temporal information. In addition, the model also uses forward - looking prediction to generate actions for multiple future steps at once, supporting asynchronous inference and execution. 3. **Cross - task collaborative learning**: Through comprehensive experiments, the synergy between different tasks and the synergy between simulated and real - world tasks are verified, demonstrating the advantages of unified modeling. 4. **Large - scale data set**: To train this large - scale model, the authors collected multi - task navigation data from multiple synthetic environments and combined real - world video - question - answer data to enhance the model's understanding of real - world images and support its open - vocabulary knowledge. Through these methods, Uni - NaVid not only performs well in multiple benchmark tests but also demonstrates its effectiveness and efficiency in real - world deployments.