Abstract:Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.

What problem does this paper attempt to address?

The paper aims to address the problem of drone navigation guided by natural language, specifically targeting the geographic localization of drones through natural language commands. To achieve this goal, the paper primarily tackles the following two key issues: 1. **Lack of large public language-guided datasets**: Currently, there is a lack of a large-scale public dataset that provides detailed descriptions of images, which significantly limits research on natural language-based drone navigation. Creating such a dataset faces challenges of high-cost human resources and the need for high-quality, reliable annotations. 2. **Difficulty in aligning language and visual representations**: Due to the richness of details in drone perspective scene images, precisely aligning natural language descriptions with the corresponding visual information remains challenging. To address these issues, the paper makes the following contributions: - **GeoText-1652 Dataset**: The authors constructed a new benchmark dataset named GeoText-1652, which is built upon the existing University-1652 image dataset. It includes rich text-bounding box pairings, establishing a one-to-one correspondence between images, text, and bounding box elements. These pairings were obtained through an innovative human-machine interactive annotation process. - **Spatial Relationship Matching Method**: A new spatial-aware method is proposed to perform region-level spatial relationship matching. This method not only considers the relative positions between objects but also utilizes the textual descriptions of surrounding locations to achieve more precise localization. - **Experimental Results**: Experiments show that this dataset helps in learning viewpoint-invariant features, thereby improving the accuracy and intuitiveness of language-based drone control. The proposed model achieved a recall rate of 31.2% @10 when using text queries, surpassing some existing models, and also demonstrated good generalization ability in unseen real-world scenarios. In summary, by introducing the GeoText-1652 dataset and a new spatial relationship matching method, this paper provides effective solutions and technical support for drone navigation guided by natural language.

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Collaborative Localization of Aerial and Ground Mobile Robots Through Orthomosaic Map

Geo-Localization with Transformer-Based 2D-3D Match Network

Game4Loc: A UAV Geo-Localization Benchmark from Game Data

University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

UAV's Status Is Worth Considering: A Fusion Representations Matching Method for Geo-Localization

Geo-Localization via Ground-to-Satellite Cross-View Image Retrieval

GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark

UAV’s Status Is Worth Considering: A Fusion Representations Matching Method for Geo-Localization

A Benchmark for UAV-View Natural Language-Guided Tracking

Monocular-GPS Fusion 3D Object Detection for UAVs

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

A Novel Geo-Localization Method for UAV and Satellite Images Using Cross-View Consistent Attention

Open 3D World in Autonomous Driving

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

AerialVLN: Vision-and-Language Navigation for UAVs

UAV Geo-Localization Dataset and Method Based on Cross-View Matching

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

SpatialBot: Precise Spatial Understanding with Vision Language Models

Vision Meets Drones: A Challenge