Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

Hongyang Li,Yang Li,Huijie Wang,Jia Zeng,Huilin Xu,Pinlong Cai,Li Chen,Junchi Yan,Feng Xu,Lu Xiong,Jingdong Wang,Futang Zhu,Chunjing Xu,Tiancai Wang,Fei Xia,Beipeng Mu,Zhihui Peng,Dahua Lin,Yu Qiao
2024-03-22
Abstract:With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to <a class="link-external link-https" href="https://github.com/OpenDriveLab/DriveAGI" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the systematic evaluation of high - quality open - source datasets in the field of autonomous driving and the future development directions. Specifically: 1. **Systematic evaluation of existing datasets**: The paper conducts a comprehensive analysis of more than 70 existing open - source autonomous driving datasets, which are from a wide range of sources. The analysis covers multiple aspects, such as the principles for creating high - quality datasets, the crucial role of data engine systems, and the methods of using generative foundation models to promote large - scale data generation. 2. **Key elements of the next - generation datasets**: The paper further analyzes the key elements that the upcoming third - generation datasets should possess, and delves into the scientific and technological challenges that need to be addressed. This includes, but is not limited to, full coverage of sensor types, rich scene coverage, high - quality raw data and annotations; formulating flexibility to support short - term and long - term use, and support new paradigms such as end - to - end frameworks and world models; and intelligence - oriented, supporting interpretability and logical reasoning in language. 3. **Impact assessment of datasets**: The paper proposes an evaluation metric to estimate the impact of autonomous driving datasets on the effectiveness of algorithm development. This metric aims to evaluate the availability, accuracy, and applicability of datasets, filling the gap in the current literature where such evaluation criteria are lacking. Through this evaluation system, the paper classifies the existing public datasets into three levels: low, medium, and high, corresponding to different score ranges respectively. In conclusion, this paper not only conducts a detailed review and evaluation of the existing autonomous driving datasets, but also provides clear directions and suggestions for constructing future autonomous driving datasets, emphasizing the importance of data quality and scale, and how to effectively utilize advanced technologies such as generative foundation models to promote the construction and development of datasets in the new era of artificial intelligence.