Abstract:Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. Despite notable advancements in ground-level navigation, the exploration of aerial navigation using these modalities remains limited. This gap primarily arises from a lack of suitable resources for real-world, city-scale aerial navigation studies. To remedy this gap, we introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in photorealistic 3D environments of real cities. CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator. Each description identifies a navigation goal, utilizing the names and locations of landmarks within actual cities. As an initial step toward addressing this challenge, we provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We have benchmarked the latest aerial navigation methods alongside our proposed baseline model on the CityNav dataset. The findings are revealing: (i) our aerial agent model trained on human demonstration trajectories, outperform those trained on shortest path trajectories by a large margin; (ii) incorporating 2D spatial map information markedly and robustly enhances navigation performance at a city scale; (iii) despite the use of map information, our challenging CityNav dataset reveals a persistent performance gap between our baseline models and human performance. To foster further research in aerial VLN, we have made the dataset and code available at <a class="link-external link-https" href="https://water-cookie.github.io/city-nav-proj/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the problem of city-scale language-guided aerial navigation. Specifically, the researchers focus on how to use visual and language cues to guide unmanned aerial vehicles (UAVs) for efficient navigation in 3D environments of real cities. ### Background and Challenges 1. **Limitations of Existing Datasets**: - Despite significant progress in ground-level visual and language navigation (VLN), research on aerial navigation is relatively scarce. - Existing aerial navigation datasets are usually based on virtual environments or satellite images, lacking the complexity and geographic information of the real world, which limits their effectiveness in practical applications. 2. **Unique Challenges**: - Aerial navigation involves vast 3D spaces, uncertain paths, and the need to handle complex geographic features. - In situations like natural disasters or unreliable GNSS signals, traditional algorithmic route planning methods may fail. ### Solutions 1. **CityNav Dataset**: - The researchers introduced the CityNav dataset, specifically designed for city-scale language-guided aerial navigation. - This dataset contains 32,000 natural language descriptions and corresponding human demonstration trajectories, collected on 3D point cloud data of real cities. - Each description specifies a navigation goal and uses landmark names and location information from actual cities. 2. **Baseline Models**: - The researchers provided baseline models for navigation agents based on internal 2D spatial maps representing landmarks mentioned in the descriptions. - These baseline models were benchmarked on the CityNav dataset, showing that: - Aerial agent models trained on human demonstration trajectories significantly outperformed those trained on shortest path trajectories. - Introducing 2D spatial map information significantly improved city-scale navigation performance. - Despite using map information, the performance of baseline models still lagged behind human performance. ### Main Contributions 1. **Development of a New 3D Flight Simulator**: - This simulator runs on a browser and integrates with Amazon Mechanical Turk (MTurk) for large-scale collection of human-generated flight trajectories. 2. **Introduction of the CityNav Dataset**: - Contains 32,637 language goal descriptions and corresponding demonstration trajectories, utilizing 3D scan data and geographic information of real cities. 3. **Provision of Map-Based Baseline Models**: - These models include an internal 2D spatial map to represent geographic information, addressing the extensive search space of city-scale navigation. 4. **Demonstration of the Advantages of Combining Human Strategies and Geographic Information**: - Combining human-driven strategies and geographic information significantly enhanced city-scale aerial navigation performance under both normal and challenging conditions. Through these contributions, the researchers provide important resources and foundations for future aerial VLN research.

CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

AerialVLN: Vision-and-Language Navigation for UAVs

Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model

LangNav: Language as a Perceptual Representation for Navigation

VLAI: Exploration and exploitation based on visual-language aligned information for robotic object goal navigation

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Navigation with VLM framework: Go to Any Language

Talk2Nav: Long-Range Vision-and-Language Navigation with Dual Attention and Spatial Memory

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action

VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation

VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation

CityRefer: Geography-aware 3D Visual Grounding Dataset on City-scale Point Cloud Data

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

DeepNav: Learning to Navigate Large Cities

AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Scaling Data Generation in Vision-and-Language Navigation

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation