HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

Zehuan Wu,Jingcheng Ni,Xiaodong Wang,Yuxin Guo,Rui Chen,Lewei Lu,Jifeng Dai,Yuwen Xiong
2024-12-02
Abstract:Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving. However, a real-world autonomous driving system uses multiple kinds of input modality, usually cameras and LiDARs, where they contain complementary information for generation, while existing generation methods ignore this crucial feature, resulting in the generated results only covering separate 2D or 3D information. In order to fill the gap in 2D-3D multi-modal joint generation for autonomous driving, in this paper, we propose our framework, \emph{HoloDrive}, to jointly generate the camera images and LiDAR point clouds. We employ BEV-to-Camera and Camera-to-BEV transform modules between heterogeneous generative models, and introduce a depth prediction branch in the 2D generative model to disambiguate the un-projecting from image space to BEV space, then extend the method to predict the future by adding temporal structure and carefully designed progressive training. Further, we conduct experiments on single frame generation and world model benchmarks, and demonstrate our method leads to significant performance gains over SOTA methods in terms of generation metrics.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to jointly generate multi - view camera images and LiDAR point clouds in the autonomous driving scenario to achieve consistent generation of 2D and 3D modalities. Existing generation methods mainly focus on single - modality data generation, such as generating only 2D images or 3D point clouds, while ignoring the complementary information between multi - modality data. This results in the generated results being able to cover only separate 2D or 3D information, lacking a comprehensive consideration of the multiple input modalities required by real - world autonomous driving systems. Therefore, the paper proposes a framework named HoloDrive, aiming to fill the gap in 2D - 3D multi - modality joint generation. It solves this problem by combining BEV - to - Camera and Camera - to - BEV conversion modules and introducing a depth prediction branch in the 2D generation model. In addition, the paper also extends the method's prediction ability for future scenarios by adding a temporal structure and a carefully designed progressive training method. The HoloDrive framework not only improves the quality of single - frame generation but also performs well in video generation tasks, significantly outperforming existing methods.