SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

Binyuan Huang,Yuqing Wen,Yucheng Zhao,Yaosi Hu,Yingfei Liu,Fan Jia,Weixin Mao,Tiancai Wang,Chi Zhang,Chang Wen Chen,Zhenzhong Chen,Xiangyu Zhang
2024-03-28
Abstract:Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities of freely-labeled data for autonomous driving applications and present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We investigate the impact of scaling up the quantity of generative data on the performance of downstream perception models and find that enhancing data diversity plays a crucial role in effectively scaling generative data production. Therefore, we have developed a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data. Extensive evaluations confirm SubjectDrive's efficacy in generating scalable autonomous driving training data, marking a significant step toward revolutionizing data production methods in this field.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to generate large - scale and diverse labeled data for autonomous driving applications through the generative model to improve the performance of downstream perception tasks (such as 3D object detection and tracking). Specifically, the paper focuses on how to effectively expand the quantity of generated data and ensure the quality and diversity of these generated data, thereby significantly improving the performance of perception models trained on the generated data. ### Background and Problem Description of the Paper 1. **The Need for Large - Scale Labeled Data in Autonomous Driving** - The progress of autonomous driving depends on large - scale labeled data sets. - Acquiring and labeling real - world data is both expensive and time - consuming, and there are also issues regarding data privacy and usage rights. - Therefore, exploring the use of generative models to create a large amount of freely labeled data has become an important research direction. 2. **Limitations of Existing Methods** - Although existing generative models can generate high - quality driving - scene videos, their effectiveness in expanding the amount of generated data is limited. - For example, methods such as Panacea fail to significantly improve the performance of downstream perception tasks when generating a large amount of data, mainly due to the lack of diversity in the generated data. 3. **The Importance of Introducing the Topic Control Mechanism** - In order to overcome the limitations of existing methods, the paper proposes a new generative framework - SubjectDrive, which enhances the diversity of generated data by introducing a topic control mechanism. - The topic control mechanism allows the generative model to utilize diverse elements in external data sources, thereby generating more diverse and useful samples. ### Core Contributions of the Paper - **Proposing the SubjectDrive Framework**: This framework significantly improves the diversity and quality of generated data by introducing a topic control mechanism. - **Verifying the Effectiveness of Generated Data Expansion**: Through experiments, SubjectDrive can not only effectively improve the performance of downstream perception tasks when the amount of generated data increases, but also outperform pre - trained models on large - scale real - data sets. - **Innovative Technical Modules**: Including the Topic Prompt Adapter (SPA), the Topic Visual Adapter (SVA), and the Enhanced Temporal Attention (ATA). These modules work together to make the generated videos perform well in spatio - temporal consistency. ### Formula Representation The formulas involved in the paper are represented in Markdown format as follows: 1. **Generation Process of the Diffusion Model** \[ p_\theta(x_{t - 1}\mid x_t)=\mathcal{N}(x_{t - 1};\mu_\theta(x_t,t),\Sigma_\theta(x_t,t)) \] \[ x_t = \sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon,\quad\epsilon\sim\mathcal{N}(0,I),\quad x_0\sim p(x) \] \[ \min_\theta\mathbb{E}_{t,x,\epsilon}\|\epsilon - \epsilon_\theta(x_t,t)\|^2 \] 2. **Enhanced Text Embedding of the Topic Prompt Adapter (SPA)** \[ \hat{z}_t^i=\text{MLP}([z_t^i + z_{\text{id}}^i,z_v^i]),\quad i\in\{1,2,\dots,M\} \] 3. **Position - Enhanced Subject Embedding of the Topic Visual Adapter (SVA)** \[ f_v^l=\text{MLP}([f_v,\text{Fourier}(l)]) \] \[ z = z+\tanh(\gamma)\cdot T_S(\text{SelfAttn}([z,f_v^l]))