Abstract:A practical navigation agent must be capable of handling a wide range of interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied navigation fall short of serving as practical generalists in the real world, as they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, the first video-based vision-language-action (VLA) model designed to unify diverse embodied navigation tasks and enable seamless navigation for mixed long-horizon tasks in unseen real-world environments. Uni-NaVid achieves this by harmonizing the input and output data configurations for all commonly used embodied navigation tasks and thereby integrating all tasks in one model. For training Uni-NaVid, we collect 3.6 million navigation data samples in total from four essential navigation sub-tasks and foster synergy in learning across them. Extensive experiments on comprehensive navigation benchmarks clearly demonstrate the advantages of unification modeling in Uni-NaVid and show it achieves state-of-the-art performance. Additionally, real-world experiments confirm the model's effectiveness and efficiency, shedding light on its strong generalizability.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve a unified model design in diverse navigation tasks, so as to enable seamless navigation in unseen real - world environments. Specifically, the paper proposes the Uni - NaVid model, aiming to integrate four common embodied navigation tasks: Vision - and - Language Navigation (VLN), Object Goal Navigation, Embodied Question Answering (EQA) and Human Following. These tasks require the model to be able to understand natural - language instructions, recognize objects in the environment, answer questions about the environment and track specific human targets. Uni - NaVid addresses these problems in the following ways: 1. **Unified task modeling**: Uni - NaVid integrates different navigation tasks into one model by unifying the input and output data configurations, thereby achieving general processing for multiple tasks. 2. **Efficient model design**: To improve the efficiency of the model, Uni - NaVid adopts an online visual token merging mechanism to compress the visual information of historical frames while retaining fine - grained spatial information and structured temporal information. In addition, the model also uses forward - looking prediction to generate actions for multiple future steps at once, supporting asynchronous inference and execution. 3. **Cross - task collaborative learning**: Through comprehensive experiments, the synergy between different tasks and the synergy between simulated and real - world tasks are verified, demonstrating the advantages of unified modeling. 4. **Large - scale data set**: To train this large - scale model, the authors collected multi - task navigation data from multiple synthetic environments and combined real - world video - question - answer data to enhance the model's understanding of real - world images and support its open - vocabulary knowledge. Through these methods, Uni - NaVid not only performs well in multiple benchmark tests but also demonstrates its effectiveness and efficiency in real - world deployments.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

Towards Versatile Embodied Navigation

ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Active Visual Information Gathering for Vision-Language Navigation

Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

Object-and-Action Aware Model for Visual Language Navigation

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Vision-Language Navigation Policy Learning and Adaptation

NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

Towards Learning a Generalist Model for Embodied Navigation

Active Perception for Visual-Language Navigation

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation using Large Language Models