Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

Yifei Su,Dong An,Yuan Xu,Kehan Chen,Yan Huang

2023-12-14

Abstract:This report details the methods of the winning entry of the AVDN Challenge in ICCV CLVL 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition,an auxiliary visual grounding task is devised to boost the agent's awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at <a class="link-external link-https" href="https://github.com/yifeisu/TG-GAT" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main goal of this paper is to address the cross-modal association problem in the Aerial Navigation from Dialog History (ANDH) task. Specifically, the paper proposes a new framework called Target-Grounded Graph-Aware Transformer (TG-GAT), which aims to improve the UAV agent's understanding of dialog history and its ability to accurately locate landmarks while performing navigation tasks. This goal is achieved through the following three main innovations: 1. **Graph-Aware Transformer**: Utilizes a graph attention mechanism to combine dialog with structured visual historical observations, providing more comprehensive spatial and temporal information for action planning. 2. **Auxiliary Grounding Task**: Designs a fine-grained visual grounding task to enhance the model's ability to recognize mentioned landmarks, requiring the model to predict the precise bounding box of the specified landmark. 3. **Hybrid Data Augmenter**: Employs data augmentation based on large language models to alleviate the problem of insufficient training data, including rewriting human instructions and applying various augmentations to images. Through these methods, the TG-GAT framework achieved significant performance improvements in the AVDN challenge at ICCV CLVL 2023, with increases of 3.0% in Success Rate (SR) and 2.2% in Success weighted by inverse Path Length (SPL). This indicates that TG-GAT can effectively address the challenges in the ANDH task, such as difficulties in navigation state tracking due to long trajectories and wide field-of-view, as well as issues with landmark localization accuracy.

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

Aerial Vision-and-Dialog Navigation

Multi-model fusion for Aerial Vision and Dialog Navigation based on human attention aids

TransNav: spatial sequential transformer network for visual navigation

Improving Target-driven Visual Navigation with Attention on 3D Spatial Relationships

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Target-Driven Structured Transformer Planner for Vision-Language Navigation

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual Transformers with Joint Student-Teacher Learning

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

Goal-Guided Transformer-Enabled Reinforcement Learning for Efficient Autonomous Navigation

SOAT: A Scene- and Object-Aware Transformer for Vision-and-Language Navigation

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

AerialVLN: Vision-and-Language Navigation for UAVs

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

Decoupled Spatial Temporal Graphs for Generic Visual Grounding

SGDViT: Saliency-Guided Dynamic Vision Transformer for UAV Tracking

Learning multimodal adaptive relation graph and action boost memory for visual navigation

A Global-Memory-Aware Transformer for Vision-and-Language Navigation

Vision-and-Language Navigation Generative Pretrained Transformer