Abstract:Vision-and-Language Navigation (VLN), as a widely discussed research direction in embodied intelligence, aims to enable embodied agents to navigate in complicated visual environments through natural language commands. Most existing VLN methods focus on indoor ground robot scenarios. However, when applied to UAV VLN in outdoor urban scenes, it faces two significant challenges. First, urban scenes contain numerous objects, which makes it challenging to match fine-grained landmarks in images with complex textual descriptions of these landmarks. Second, overall environmental information encompasses multiple modal dimensions, and the diversity of representations significantly increases the complexity of the encoding process. To address these challenges, we propose NavAgent, the first urban UAV embodied navigation model driven by a large Vision-Language Model. NavAgent undertakes navigation tasks by synthesizing multi-scale environmental information, including topological maps (global), panoramas (medium), and fine-grained landmarks (local). Specifically, we utilize GLIP to build a visual recognizer for landmark capable of identifying and linguisticizing fine-grained landmarks. Subsequently, we develop dynamically growing scene topology map that integrate environmental information and employ Graph Convolutional Networks to encode global environmental data. In addition, to train the visual recognizer for landmark, we develop NavAgent-Landmark2K, the first fine-grained landmark dataset for real urban street scenes. In experiments conducted on the Touchdown and Map2seq datasets, NavAgent outperforms strong baseline models. The code and dataset will be released to the community to facilitate the exploration and development of outdoor VLN.

Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial Environmental Representations for Vision-and-Language Navigation

Unifying Terrain Awareness Through Real-Time Semantic Segmentation

Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Vision-and-Language Navigation via Latent Semantic Alignment Learning

Volumetric Environment Representation for Vision-Language Navigation

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

Active Visual Information Gathering for Vision-Language Navigation

Active Perception for Visual-Language Navigation

Learning Navigational Visual Representations with Semantic Map Supervision

A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation

Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation

Vision and Language Navigation in the Real World via Online Visual Language Mapping

Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation

Think Holistically, Act Down-to-Earth: A Semantic Navigation Strategy with Continuous Environmental Representation and Multi-step Forward Planning

Self-Supervised 3-D Semantic Representation Learning for Vision-and-Language Navigation

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model