HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Zehuan Wu,Jingcheng Ni,Xiaodong Wang,Yuxin Guo,Rui Chen,Lewei Lu,Jifeng Dai,Yuwen Xiong

2024-12-02

Abstract:Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to jointly generate multi - view camera images and LiDAR point clouds in the autonomous driving scenario to achieve consistent generation of 2D and 3D modalities. Existing generation methods mainly focus on single - modality data generation, such as generating only 2D images or 3D point clouds, while ignoring the complementary information between multi - modality data. This results in the generated results being able to cover only separate 2D or 3D information, lacking a comprehensive consideration of the multiple input modalities required by real - world autonomous driving systems. Therefore, the paper proposes a framework named HoloDrive, aiming to fill the gap in 2D - 3D multi - modality joint generation. It solves this problem by combining BEV - to - Camera and Camera - to - BEV conversion modules and introducing a depth prediction branch in the 2D generation model. In addition, the paper also extends the method's prediction ability for future scenarios by adding a temporal structure and a carefully designed progressive training method. The HoloDrive framework not only improves the quality of single - frame generation but also performs well in video generation tasks, significantly outperforming existing methods.

HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

MagicDrive3D: Controllable 3D Generation for Any-View Rendering in Street Scenes

Street-View Image Generation from a Bird's-Eye View Layout

Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention

DriveScape: Towards High-Resolution Controllable Multi-View Driving Video Generation

BEVControl: Accurately Controlling Street-view Elements with Multi-perspective Consistency via BEV Sketch Layout

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning

WoVoGen: World Volume-aware Diffusion for Controllable Multi-camera Driving Scene Generation

DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation

MagicDrive: Street View Generation with Diverse 3D Geometry Control

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios

BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

A Unified Generative Framework for Realistic Lidar Simulation in Autonomous Driving Systems

Monocular BEV Perception of Road Scenes Via Front-to-Top View Projection

Stag-1: Towards Realistic 4D Driving Simulation with Video Generation Model

Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation

Bi-Mapper: Holistic BEV Semantic Mapping for Autonomous Driving