Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Meng Chu,Zhedong Zheng,Wei Ji,Tingyu Wang,Tat-Seng Chua
2024-07-31
Abstract:Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The paper aims to address the problem of drone navigation guided by natural language, specifically targeting the geographic localization of drones through natural language commands. To achieve this goal, the paper primarily tackles the following two key issues: 1. **Lack of large public language-guided datasets**: Currently, there is a lack of a large-scale public dataset that provides detailed descriptions of images, which significantly limits research on natural language-based drone navigation. Creating such a dataset faces challenges of high-cost human resources and the need for high-quality, reliable annotations. 2. **Difficulty in aligning language and visual representations**: Due to the richness of details in drone perspective scene images, precisely aligning natural language descriptions with the corresponding visual information remains challenging. To address these issues, the paper makes the following contributions: - **GeoText-1652 Dataset**: The authors constructed a new benchmark dataset named GeoText-1652, which is built upon the existing University-1652 image dataset. It includes rich text-bounding box pairings, establishing a one-to-one correspondence between images, text, and bounding box elements. These pairings were obtained through an innovative human-machine interactive annotation process. - **Spatial Relationship Matching Method**: A new spatial-aware method is proposed to perform region-level spatial relationship matching. This method not only considers the relative positions between objects but also utilizes the textual descriptions of surrounding locations to achieve more precise localization. - **Experimental Results**: Experiments show that this dataset helps in learning viewpoint-invariant features, thereby improving the accuracy and intuitiveness of language-based drone control. The proposed model achieved a recall rate of 31.2% @10 when using text queries, surpassing some existing models, and also demonstrated good generalization ability in unseen real-world scenarios. In summary, by introducing the GeoText-1652 dataset and a new spatial relationship matching method, this paper provides effective solutions and technical support for drone navigation guided by natural language.