Abstract:Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing research and application of Unmanned Aerial Vehicle (UAV) agents mainly focus on indoor scenarios or ground - based agents, while the research on UAV agents in the aerospace field is still in its infancy. Specifically, the existing embodied world models mainly focus on the autonomous perception, cognition, and action of ground - level agents in indoor and outdoor environments, and there is less research on UAV agents, especially aerospace - embodied world models, which has led to the following problems: 1. **Lack of Definition of UAV - Embodied Tasks**: Compared with ground - based agents, UAV agents need to understand the internal relationships in four - dimensional space - time and perform actions under the conditions of scene randomization and local observability. This involves multiple aspects such as perception, cognition, planning, and decision - making, making the definition of downstream tasks complex and unclear. 2. **Difficulty in Obtaining 3D Data**: It is more difficult to obtain UAV 3D data, which requires professional equipment and technicians to operate. Especially in outdoor environments, obtaining large - scale 3D point - cloud data has high costs and high technical thresholds. 3. **High Cost of UAV - Embodied Data Collection**: UAVs have a larger range of motion and higher degrees of freedom, and can move freely in three - dimensional space, covering large - area regions and traversing complex obstacle environments. Therefore, collecting and annotating UAV - embodied data requires a large amount of time and human resources. To solve these problems, this paper proposes AeroVerse, a benchmark suite specifically used for simulating, pre - training, fine - tuning, and evaluating aerospace - embodied world models. Specific contributions include: 1. **Constructing a Large - Scale Real - World Image - Text Pre - training Dataset**: Developed the first large - scale real - world pre - training dataset AerialAgent - Ego10k filmed from the first - person perspective of high - altitude UAVs, as well as the virtually aligned pre - training dataset CyberAgent - Ego500k, to enhance the adaptability of UAV agents in real and virtual environments. 2. **For the First Time, Clearly Defining Five Aerospace - Embodied Downstream Tasks**: Including scene perception, spatial reasoning, navigation exploration, task planning, and motion decision - making, and creating the corresponding instruction datasets SkyAgent - Scene3k, SkyAgent - Reason3k, SkyAgent - Nav3k, SkyAgent - Plan3k, and SkyAgent - Act3k, supporting the end - to - end perception, cognition, and action closed - loop. 3. **Developing an Automated Evaluation Method**: Developed SkyAgent - Eval based on GPT - 4 for comprehensively, flexibly, and objectively evaluating the results of downstream tasks, providing quantitative scores and explanations, and enhancing the credibility of evaluation results. 4. **Extensive Experimental Verification**: Conducted a large number of experiments using ten mainstream baseline models, analyzed their performance on downstream instruction datasets, and revealed the potential and limitations of 2D/3D vision - language models in UAV agent tasks. Through these efforts, this paper aims to fill the gap in UAV agent research and promote the development of aerospace - embodied intelligence.

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models

The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking

AerialVLN: Vision-and-Language Navigation for UAVs

EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation

UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles

UEMM-Air: A Synthetic Multi-modal Dataset for Unmanned Aerial Vehicle Object Detection

U2USim - A UAV Telepresence Simulation Platform with Multi-agent Sensing and Dynamic Environment

UAV3D: A Large-scale 3D Perception Benchmark for Unmanned Aerial Vehicles

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Active Human Pose Estimation via an Autonomous UAV Agent

AgentStudio: A Toolkit for Building General Virtual Agents

An Embodied Generalist Agent in 3D World

Demo Abstract: Embodied Aerial Agent for City-level Visual Language Navigation Using Large Language Model

Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments

Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

Aerial Vision-and-Dialog Navigation

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

UVCPNet: A UAV-Vehicle Collaborative Perception Network for 3D Object Detection

Skyeyes: Ground Roaming using Aerial View Images