Abstract:Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing Visual Language Navigation (VLN) tasks are carried out in static environments and cannot truly reflect the challenges that dynamic obstacles (such as moving humans) in the real world bring to robot navigation. To bridge this gap, the authors propose Adaptive Visual Language Navigation (AdaVLN), an extended VLN task aimed at enabling robots to navigate in complex 3D indoor environments containing dynamic human obstacles. ### Specific Problem Description 1. **Limitations of Static Environments**: - Existing VLN tasks mainly focus on static environments, ignoring dynamic elements in the real world, such as moving humans and other obstacles. - In real - world scenarios, robots need to handle constantly changing environments, including moving objects and people, which all require real - time prediction and obstacle avoidance. 2. **Introduction of Dynamic Obstacles**: - The AdaVLN task requires robots not only to understand natural language instructions and navigate to target locations, but also to avoid collisions with static and dynamic obstacles. - The states (positions and directions) of dynamic obstacles change over time, increasing the complexity of the task. 3. **Improvements in Simulators and Datasets**: - To support the research of the AdaVLN task, the authors introduce the AdaSimulator and AdaR2R dataset. - The AdaSimulator is based on IsaacSim and supports a physics engine, animated human models, and precise robot motion. - The AdaR2R dataset adds the paths and configurations of dynamic obstacles to the existing Matterport3D environment. 4. **Experimental Evaluation**: - The paper evaluates the performance of several baseline models in the AdaVLN task through a series of experiments and analyzes the impact of dynamic obstacles on navigation performance. - It focuses on examining the collision situations between robots and the environment and human obstacles, and conducts qualitative and quantitative analyses. ### Summary The main contribution of the paper is to propose a Visual Language Navigation task - AdaVLN, which is closer to the real world. By introducing dynamic obstacles and an improved simulator, robots can navigate in more complex environments. This provides new challenges and opportunities for future research, especially in the field of robot navigation in practical applications.

AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Language-guided Navigation Via Cross-Modal Grounding and Alternate Adversarial Learning

Vision-Language Navigation with Continual Learning

Vision-Language Navigation Policy Learning and Adaptation

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

Navigation with VLM framework: Go to Any Language

Active Visual Information Gathering for Vision-Language Navigation

AerialVLN: Vision-and-Language Navigation for UAVs

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

L3MVN: Leveraging Large Language Models for Visual Target Navigation

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Active Perception for Visual-Language Navigation

Navigating Beyond Instructions: Vision-and-Language Navigation in Obstructed Environments

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Continual Vision-and-Language Navigation

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation