Abstract:Generating multi-view videos for autonomous driving training has recently gained much attention, with the challenge of addressing both cross-view and cross-frame consistency. Existing methods typically apply decoupled attention mechanisms for spatial, temporal, and view dimensions. However, these approaches often struggle to maintain consistency across dimensions, particularly when handling fast-moving objects that appear at different times and viewpoints. In this paper, we present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos. CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the spatial, temporal, and viewpoint dimensions. We also propose a lightweight controller tailored for CogDriving, i.e., Micro-Controller, which uses only 1.1% of the parameters of the standard ControlNet, enabling precise control over Bird's-Eye-View layouts. To enhance the generation of object instances crucial for autonomous driving, we propose a re-weighted learning objective, dynamically adjusting the learning weights for object instances during training. CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos. The project can be found at <a class="link-external link-https" href="https://luhannan.github.io/CogDrivingPage/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate high - quality multi - view driving - scene videos in autonomous driving training. Specifically, the paper focuses on how to handle the cross - view and cross - frame consistency issues simultaneously. Existing methods usually adopt separate attention mechanisms to handle the spatial, temporal and view dimensions, but these methods have difficulty maintaining consistency when dealing with fast - moving objects, especially those appearing at different times and views. Therefore, the paper proposes a new network architecture - CogDriving, which aims to synthesize high - quality multi - view driving videos through holistic - 4D attention modules, thereby overcoming the limitations of existing methods. The main contributions of CogDriving include: - Proposing an innovative Diffusion Transformer equipped with holistic - 4D attention modules, which can simultaneously model the associations among the spatial, temporal and view dimensions. - Introducing a lightweight control branch - Micro - Controller, specifically for the 4D attention in CogDriving, with the number of parameters being only 1.1% of that of the standard ControlNet, but still being able to achieve competitive control over the generated results. - Designing a re - weighted learning objective that emphasizes the supervised learning of object instances to balance the generation of object instances and background content. - CogDriving has achieved an FVD score of 37.8 on the nuScenes dataset, indicating its ability to generate high - quality driving - scene videos. Experiments also show that the synthesized videos can significantly improve the performance of the state - of - the - art BEV perception models, verifying its practical application value.

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

DiVE: DiT-based Video Generation with Enhanced Control

DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model

A Multi-view 3D Vehicle Detection Method Based On Novel 3D Proposal Generation Method

X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios

MyGo: Consistent and Controllable Multi-View Driving Video Generation with Camera Control

DeepGoal: Learning to drive with driving intention from human control demonstration

Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

DreamForge: Motion-Aware Autoregressive Video Generation for Multi-View Driving Scenes

HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference

Multi-View Fusion of Sensor Data for Improved Perception and Prediction in Autonomous Driving

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

A General Framework of Learning Multi-Vehicle Interaction Patterns from Video