Multi-model fusion for Aerial Vision and Dialog Navigation based on human attention aids

Xinyi Wang,Xuan Cui,Danxu Li,Fang Liu,Licheng Jiao

2023-08-27

Abstract:Drones have been widely used in many areas of our daily lives. It relieves people of the burden of holding a controller all the time and makes drone control easier to use for people with disabilities or occupied hands. However, the control of aerial robots is more complicated compared to normal robots due to factors such as uncontrollable height. Therefore, it is crucial to develop an intelligent UAV that has the ability to talk to humans and follow natural language commands. In this report, we present an aerial navigation task for the 2023 ICCV Conversation History. Based on the AVDN dataset containing more than 3k recorded navigation trajectories and asynchronous human-robot conversations, we propose an effective method of fusion training of Human Attention Aided Transformer model (HAA-Transformer) and Human Attention Aided LSTM (HAA-LSTM) model, which achieves the prediction of the navigation routing points and human attention. The method not only achieves high SR and SPL metrics, but also shows a 7% improvement in GP metrics compared to the baseline model.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the complexity of controlling drones in everyday applications, particularly in situations where the user's hands are inconvenient (such as in cases of disability or when both hands are occupied) and how to control the drone for navigation through dialogue. Specifically, the research goal is to develop an intelligent drone system capable of natural language communication with humans and navigating based on dialogue instructions. To achieve this goal, the authors propose an effective fusion training method based on the AVDN dataset, which contains over 3,000 records of navigation trajectories and asynchronous human-machine dialogue records. By integrating the Human Attention Assisted Transformer model (HAA-Transformer) and the Human Attention Assisted LSTM model (HAA-LSTM), this method not only achieves high Success Rate (SR) and Path Length weighted Success Rate (SPL) but also improves the Goal Progress (GP) metric by 7% compared to baseline models. Additionally, the research explores the impact of different training iterations on model performance and verifies that the effect of multi-model fusion is superior to that of single-model training. In summary, the main contribution of this paper lies in proposing a new method that utilizes deep learning techniques to process multimodal information, enabling drones to better perform navigation tasks in complex outdoor environments and enhancing user experience through natural dialogue.

Multi-model fusion for Aerial Vision and Dialog Navigation based on human attention aids

Aerial Vision-and-Dialog Navigation

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

AerialVLN: Vision-and-Language Navigation for UAVs

M2DA: Multi-Modal Fusion Transformer Incorporating Driver Attention for Autonomous Driving

FlyTransformer: A Cross-Modal Fusion Policy for UAV End-to-End Trajectory Planning

TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes

Drones Help Drones: A Collaborative Framework for Multi-Drone Object Trajectory Prediction and Beyond

Real-Time Multi-Modal Active Vision for Object Detection on UAVs Equipped With Limited Field of View LiDAR and Camera

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

A multi-modal spatial–temporal model for accurate motion forecasting with visual fusion

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Human Observation-Inspired Trajectory Prediction for Autonomous Driving in Mixed-Autonomy Traffic Environments

Multi-View Fusion of Sensor Data for Improved Perception and Prediction in Autonomous Driving

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

From One to Many: Dynamic Cross Attention Networks for LiDAR and Camera Fusion

Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model

FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving

A Multi-task Transformer Architecture for Drone State Identification and Trajectory Prediction

A multimodal vehicle trajectory prediction method with spatio-temporal feature fusion