Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

Yifei Su,Dong An,Yuan Xu,Kehan Chen,Yan Huang
2023-12-14
Abstract:This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at <a class="link-external link-https" href="https://github.com/yifeisu/TG-GAT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to address the cross-modal association problem in the Aerial Navigation from Dialog History (ANDH) task. Specifically, the paper proposes a new framework called Target-Grounded Graph-Aware Transformer (TG-GAT), which aims to improve the UAV agent's understanding of dialog history and its ability to accurately locate landmarks while performing navigation tasks. This goal is achieved through the following three main innovations: 1. **Graph-Aware Transformer**: Utilizes a graph attention mechanism to combine dialog with structured visual historical observations, providing more comprehensive spatial and temporal information for action planning. 2. **Auxiliary Grounding Task**: Designs a fine-grained visual grounding task to enhance the model's ability to recognize mentioned landmarks, requiring the model to predict the precise bounding box of the specified landmark. 3. **Hybrid Data Augmenter**: Employs data augmentation based on large language models to alleviate the problem of insufficient training data, including rewriting human instructions and applying various augmentations to images. Through these methods, the TG-GAT framework achieved significant performance improvements in the AVDN challenge at ICCV CLVL 2023, with increases of 3.0% in Success Rate (SR) and 2.2% in Success weighted by inverse Path Length (SPL). This indicates that TG-GAT can effectively address the challenges in the ANDH task, such as difficulties in navigation state tracking due to long trajectories and wide field-of-view, as well as issues with landmark localization accuracy.