Abstract:Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual language models in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-language models, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing research and application of Unmanned Aerial Vehicle (UAV) agents mainly focus on indoor scenarios or ground - based agents, while the research on UAV agents in the aerospace field is still in its infancy. Specifically, the existing embodied world models mainly focus on the autonomous perception, cognition, and action of ground - level agents in indoor and outdoor environments, and there is less research on UAV agents, especially aerospace - embodied world models, which has led to the following problems:
1. **Lack of Definition of UAV - Embodied Tasks**: Compared with ground - based agents, UAV agents need to understand the internal relationships in four - dimensional space - time and perform actions under the conditions of scene randomization and local observability. This involves multiple aspects such as perception, cognition, planning, and decision - making, making the definition of downstream tasks complex and unclear.
2. **Difficulty in Obtaining 3D Data**: It is more difficult to obtain UAV 3D data, which requires professional equipment and technicians to operate. Especially in outdoor environments, obtaining large - scale 3D point - cloud data has high costs and high technical thresholds.
3. **High Cost of UAV - Embodied Data Collection**: UAVs have a larger range of motion and higher degrees of freedom, and can move freely in three - dimensional space, covering large - area regions and traversing complex obstacle environments. Therefore, collecting and annotating UAV - embodied data requires a large amount of time and human resources.
To solve these problems, this paper proposes AeroVerse, a benchmark suite specifically used for simulating, pre - training, fine - tuning, and evaluating aerospace - embodied world models. Specific contributions include:
1. **Constructing a Large - Scale Real - World Image - Text Pre - training Dataset**: Developed the first large - scale real - world pre - training dataset AerialAgent - Ego10k filmed from the first - person perspective of high - altitude UAVs, as well as the virtually aligned pre - training dataset CyberAgent - Ego500k, to enhance the adaptability of UAV agents in real and virtual environments.
2. **For the First Time, Clearly Defining Five Aerospace - Embodied Downstream Tasks**: Including scene perception, spatial reasoning, navigation exploration, task planning, and motion decision - making, and creating the corresponding instruction datasets SkyAgent - Scene3k, SkyAgent - Reason3k, SkyAgent - Nav3k, SkyAgent - Plan3k, and SkyAgent - Act3k, supporting the end - to - end perception, cognition, and action closed - loop.
3. **Developing an Automated Evaluation Method**: Developed SkyAgent - Eval based on GPT - 4 for comprehensively, flexibly, and objectively evaluating the results of downstream tasks, providing quantitative scores and explanations, and enhancing the credibility of evaluation results.
4. **Extensive Experimental Verification**: Conducted a large number of experiments using ten mainstream baseline models, analyzed their performance on downstream instruction datasets, and revealed the potential and limitations of 2D/3D vision - language models in UAV agent tasks.
Through these efforts, this paper aims to fill the gap in UAV agent research and promote the development of aerospace - embodied intelligence.