CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information

Jungdae Lee,Taiki Miyanishi,Shuhei Kurita,Koya Sakamoto,Daichi Azuma,Yutaka Matsuo,Nakamasa Inoue
2024-10-06
Abstract:Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. Despite notable advancements in ground-level navigation, the exploration of aerial navigation using these modalities remains limited. This gap primarily arises from a lack of suitable resources for real-world, city-scale aerial navigation studies. To remedy this gap, we introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in photorealistic 3D environments of real cities. CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator. Each description identifies a navigation goal, utilizing the names and locations of landmarks within actual cities. As an initial step toward addressing this challenge, we provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We have benchmarked the latest aerial navigation methods alongside our proposed baseline model on the CityNav dataset. The findings are revealing: (i) our aerial agent model trained on human demonstration trajectories, outperform those trained on shortest path trajectories by a large margin; (ii) incorporating 2D spatial map information markedly and robustly enhances navigation performance at a city scale; (iii) despite the use of map information, our challenging CityNav dataset reveals a persistent performance gap between our baseline models and human performance. To foster further research in aerial VLN, we have made the dataset and code available at <a class="link-external link-https" href="https://water-cookie.github.io/city-nav-proj/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the problem of city-scale language-guided aerial navigation. Specifically, the researchers focus on how to use visual and language cues to guide unmanned aerial vehicles (UAVs) for efficient navigation in 3D environments of real cities. ### Background and Challenges 1. **Limitations of Existing Datasets**: - Despite significant progress in ground-level visual and language navigation (VLN), research on aerial navigation is relatively scarce. - Existing aerial navigation datasets are usually based on virtual environments or satellite images, lacking the complexity and geographic information of the real world, which limits their effectiveness in practical applications. 2. **Unique Challenges**: - Aerial navigation involves vast 3D spaces, uncertain paths, and the need to handle complex geographic features. - In situations like natural disasters or unreliable GNSS signals, traditional algorithmic route planning methods may fail. ### Solutions 1. **CityNav Dataset**: - The researchers introduced the CityNav dataset, specifically designed for city-scale language-guided aerial navigation. - This dataset contains 32,000 natural language descriptions and corresponding human demonstration trajectories, collected on 3D point cloud data of real cities. - Each description specifies a navigation goal and uses landmark names and location information from actual cities. 2. **Baseline Models**: - The researchers provided baseline models for navigation agents based on internal 2D spatial maps representing landmarks mentioned in the descriptions. - These baseline models were benchmarked on the CityNav dataset, showing that: - Aerial agent models trained on human demonstration trajectories significantly outperformed those trained on shortest path trajectories. - Introducing 2D spatial map information significantly improved city-scale navigation performance. - Despite using map information, the performance of baseline models still lagged behind human performance. ### Main Contributions 1. **Development of a New 3D Flight Simulator**: - This simulator runs on a browser and integrates with Amazon Mechanical Turk (MTurk) for large-scale collection of human-generated flight trajectories. 2. **Introduction of the CityNav Dataset**: - Contains 32,637 language goal descriptions and corresponding demonstration trajectories, utilizing 3D scan data and geographic information of real cities. 3. **Provision of Map-Based Baseline Models**: - These models include an internal 2D spatial map to represent geographic information, addressing the extensive search space of city-scale navigation. 4. **Demonstration of the Advantages of Combining Human Strategies and Geographic Information**: - Combining human-driven strategies and geographic information significantly enhanced city-scale aerial navigation performance under both normal and challenging conditions. Through these contributions, the researchers provide important resources and foundations for future aerial VLN research.