Forging Vision Foundation Models for Autonomous Driving: Challenges, Methodologies, and Opportunities

Xu Yan,Haiming Zhang,Yingjie Cai,Jingming Guo,Weichao Qiu,Bin Gao,Kaiqiang Zhou,Yue Zhao,Huan Jin,Jiantao Gao,Zhen Li,Lihui Jiang,Wei Zhang,Hongbo Zhang,Dengxin Dai,Bingbing Liu
2024-01-16
Abstract:The rise of large foundation models, trained on extensive datasets, is revolutionizing the field of AI. Models such as SAM, DALL-E2, and GPT-4 showcase their adaptability by extracting intricate patterns and performing effectively across diverse tasks, thereby serving as potent building blocks for a wide range of AI applications. Autonomous driving, a vibrant front in AI applications, remains challenged by the lack of dedicated vision foundation models (VFMs). The scarcity of comprehensive training data, the need for multi-sensor integration, and the diverse task-specific architectures pose significant obstacles to the development of VFMs in this field. This paper delves into the critical challenge of forging VFMs tailored specifically for autonomous driving, while also outlining future directions. Through a systematic analysis of over 250 papers, we dissect essential techniques for VFM development, including data preparation, pre-training strategies, and downstream task adaptation. Moreover, we explore key advancements such as NeRF, diffusion models, 3D Gaussian Splatting, and world models, presenting a comprehensive roadmap for future research. To empower researchers, we have built and maintained
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the development and application issues of Visual Foundation Models (VFMs) in the field of autonomous driving. Specifically, the paper explores the following key points: 1. **Data Scarcity**: High-quality, multi-sensor fusion datasets required for autonomous driving are relatively limited, and obtaining these data faces multiple challenges such as privacy protection and safety regulations. 2. **Task Heterogeneity**: Autonomous driving involves a wide variety of tasks, including but not limited to object detection, semantic segmentation, and depth estimation. Each task has different requirements for input and output formats, making it difficult to build a universal architecture that efficiently handles all tasks. 3. **Model Adaptability**: Existing large-scale foundation models perform well in processing 2D images or text modalities, but in the context of autonomous driving, there is a need to effectively utilize rich 3D information and possess the ability for cross-modal fusion. To address the above issues, the paper systematically analyzes over 250 related literatures and proposes a unified framework that encompasses data preparation, self-supervised training, and downstream task adaptation. Additionally, the paper introduces the application of advanced technologies such as Generative Adversarial Networks (GANs), Diffusion Models, and Neural Radiance Fields (NeRF) in addressing data scarcity in autonomous driving. It also discusses how foundation models from other fields can be applied to the autonomous driving domain. Through this series of studies, the paper aims to provide a comprehensive technical roadmap for the future research and development of visual foundation models in autonomous driving.