Abstract:With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to <a class="link-external link-https" href="https://github.com/OpenDriveLab/DriveAGI" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the systematic evaluation of high - quality open - source datasets in the field of autonomous driving and the future development directions. Specifically: 1. **Systematic evaluation of existing datasets**: The paper conducts a comprehensive analysis of more than 70 existing open - source autonomous driving datasets, which are from a wide range of sources. The analysis covers multiple aspects, such as the principles for creating high - quality datasets, the crucial role of data engine systems, and the methods of using generative foundation models to promote large - scale data generation. 2. **Key elements of the next - generation datasets**: The paper further analyzes the key elements that the upcoming third - generation datasets should possess, and delves into the scientific and technological challenges that need to be addressed. This includes, but is not limited to, full coverage of sensor types, rich scene coverage, high - quality raw data and annotations; formulating flexibility to support short - term and long - term use, and support new paradigms such as end - to - end frameworks and world models; and intelligence - oriented, supporting interpretability and logical reasoning in language. 3. **Impact assessment of datasets**: The paper proposes an evaluation metric to estimate the impact of autonomous driving datasets on the effectiveness of algorithm development. This metric aims to evaluate the availability, accuracy, and applicability of datasets, filling the gap in the current literature where such evaluation criteria are lacking. Through this evaluation system, the paper classifies the existing public datasets into three levels: low, medium, and high, corresponding to different score ranges respectively. In conclusion, this paper not only conducts a detailed review and evaluation of the existing autonomous driving datasets, but also provides clear directions and suggestions for constructing future autonomous driving datasets, emphasizing the importance of data quality and scale, and how to effectively utilize advanced technologies such as generative foundation models to promote the construction and development of datasets in the new era of artificial intelligence.

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

A Survey on Autonomous Driving Datasets: Statistics, Annotation Quality, and a Future Outlook

Scalability in Perception for Autonomous Driving: Waymo Open Dataset.

Data-Centric Evolution in Autonomous Driving: A Comprehensive Survey of Big Data System, Data Mining, and Closed-Loop Technologies

End-to-end Autonomous Driving: Challenges and Frontiers

A Survey on Datasets for Decision-making of Autonomous Vehicle

The NEOLIX Open Dataset for Autonomous Driving

A Survey on Datasets for the Decision Making of Autonomous Vehicles

Synthetic Datasets for Autonomous Driving: A Survey

Is it Safe to Drive? An Overview of Factors, Metrics, and Datasets for Driveability Assessment in Autonomous Driving

Is it Safe to Drive? An Overview of Factors, Challenges, and Datasets for Driveability Assessment in Autonomous Driving

The OpenCDA Open-source Ecosystem for Cooperative Driving Automation Research

Towards Knowledge-driven Autonomous Driving

Preliminary Investigation into Data Scaling Laws for Imitation Learning-Based End-to-End Autonomous Driving

Development of Open Informal Dataset Affecting Autonomous Driving

The ApolloScape Open Dataset for Autonomous Driving and its Application

Collaborative Perception Datasets in Autonomous Driving: A Survey

OpenMPD: An Open Multimodal Perception Dataset for Autonomous Driving

OASim: an Open and Adaptive Simulator based on Neural Rendering for Autonomous Driving