Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

Hannan Lu,Xiaohe Wu,Shudong Wang,Xiameng Qin,Xinyu Zhang,Junyu Han,Wangmeng Zuo,Ji Tao
2024-12-05
Abstract:Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at <a class="link-external link-https" href="https://luhannan.github.io/CogDrivingPage/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality multi - view driving - scene videos in autonomous driving training. Specifically, the paper focuses on how to handle the cross - view and cross - frame consistency issues simultaneously. Existing methods usually adopt separate attention mechanisms to handle the spatial, temporal and view dimensions, but these methods have difficulty maintaining consistency when dealing with fast - moving objects, especially those appearing at different times and views. Therefore, the paper proposes a new network architecture - CogDriving, which aims to synthesize high - quality multi - view driving videos through holistic - 4D attention modules, thereby overcoming the limitations of existing methods. The main contributions of CogDriving include: - Proposing an innovative Diffusion Transformer equipped with holistic - 4D attention modules, which can simultaneously model the associations among the spatial, temporal and view dimensions. - Introducing a lightweight control branch - Micro - Controller, specifically for the 4D attention in CogDriving, with the number of parameters being only 1.1% of that of the standard ControlNet, but still being able to achieve competitive control over the generated results. - Designing a re - weighted learning objective that emphasizes the supervised learning of object instances to balance the generation of object instances and background content. - CogDriving has achieved an FVD score of 37.8 on the nuScenes dataset, indicating its ability to generate high - quality driving - scene videos. Experiments also show that the synthesized videos can significantly improve the performance of the state - of - the - art BEV perception models, verifying its practical application value.