Guosheng Zhao,Chaojun Ni,Xiaofeng Wang,Zheng Zhu,Xueyang Zhang,Yida Wang,Guan Huang,Xinze Chen,Boyuan Wang,Youyi Zhang,Wenjun Mei,Xingang Wang
Abstract:Closed-loop simulation is essential for advancing end-to-end autonomous driving systems. Contemporary sensor simulation methods, such as NeRF and 3DGS, rely predominantly on conditions closely aligned with training data distributions, which are largely confined to forward-driving scenarios. Consequently, these methods face limitations when rendering complex maneuvers (e.g., lane change, acceleration, deceleration). Recent advancements in autonomous-driving world models have demonstrated the potential to generate diverse driving videos. However, these approaches remain constrained to 2D video generation, inherently lacking the spatiotemporal coherence required to capture intricacies of dynamic driving environments. In this paper, we introduce DriveDreamer4D, which enhances 4D driving scene representation leveraging world model priors. Specifically, we utilize the world model as a data machine to synthesize novel trajectory videos based on real-world driving data. Notably, we explicitly leverage structured conditions to control the spatial-temporal consistency of foreground and background elements, thus the generated data adheres closely to traffic constraints. To our knowledge, DriveDreamer4D is the first to utilize video generation models for improving 4D reconstruction in driving scenarios. Experimental results reveal that DriveDreamer4D significantly enhances generation quality under novel trajectory views, achieving a relative improvement in FID by 24.5%, 39.0%, and 10.5% compared to PVG, S3Gaussian, and Deformable-GS. Moreover, DriveDreamer4D markedly enhances the spatiotemporal coherence of driving agents, which is verified by a comprehensive user study and the relative increases of 20.3%, 42.0%, and 13.7% in the NTA-IoU metric.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the shortcomings of current autonomous driving scene representation methods in handling complex driving behaviors (such as lane changing, acceleration, and deceleration). Existing sensor simulation methods (such as NeRF and 3DGS) mainly rely on conditions close to the training data distribution, and these methods perform poorly when dealing with complex operations. Although recent autonomous driving world models can generate diverse driving videos, these models are still limited to 2D video generation and lack the spatiotemporal consistency required for dynamic driving environments.
To solve these problems, the paper proposes **DriveDreamer4D**, which enhances 4D driving scene representation by integrating prior knowledge from autonomous driving world models. Specifically, DriveDreamer4D uses the world model as a data generator to synthesize new trajectory videos based on real-world driving data. Additionally, the paper introduces a **New Trajectory Generation Module (NTGM)** to generate diverse structured traffic conditions and independently adjust the motion dynamics of foreground and background elements in complex driving environments, ensuring that the synthesized data strictly adheres to the spatiotemporal constraints of 4D driving scenes.
### Main Contributions
1. **Proposing DriveDreamer4D**: This is the first framework to utilize prior knowledge from world models to improve the quality of 4D scene reconstruction.
2. **New Trajectory Generation Module (NTGM)**: Automatically generates diverse structured conditions, enabling DriveDreamer4D to produce new trajectory videos that include complex operations. By explicitly incorporating structured conditions, it ensures the spatiotemporal consistency of foreground and background elements.
3. **Experimental Validation**: Extensive experiments demonstrate that DriveDreamer4D significantly improves the quality of new trajectory viewpoint generation and the spatiotemporal consistency of driving scene elements. Experimental results show that compared to baseline methods (PVG, S3Gaussian, Deformable-GS), DriveDreamer4D relatively improves the FID metric by 24.5%, 39.0%, and 10.5%, respectively, and the NTA-IoU metric by 20.3%, 42.0%, and 13.7%, respectively. User studies also show that DriveDreamer4D's average win rate exceeds 80%.
### Related Work
1. **Driving Scene Representation**: NeRF and 3DGS are currently leading 3D scene representation methods, but they have limitations in handling dynamic driving environments due to input data density constraints.
2. **World Models**: World models generate videos by predicting future states, but existing models mainly generate 2D videos and lack the spatiotemporal consistency required for 4D driving scenes.
3. **Diffusion Priors for 3D Representation**: These methods extend training perspectives through generative models, but they mainly handle sparse image data or static background elements and cannot fully capture the complexity of 4D driving environments.
### Methodology
1. **4D Driving Scene Representation**: The 4DGS model represents driving scenes by integrating 3DGS and temporal field modules.
2. **Controllable Driving Video Generation World Model**: The world model module predicts future states based on imagined action sequences, guiding future video predictions through structured information or action control.
3. **DriveDreamer4D**:
- **New Trajectory Generation Module (NTGM)**: Adjusts original trajectory actions (such as steering angle, speed) to generate new trajectories, providing new perspectives for extracting structured information (such as 3D bounding boxes, high-definition map details).
- **4D Reconstruction and Video Diffusion Prior**: Generates diverse trajectory videos based on video diffusion priors, optimizing the 4DGS model by combining original and new trajectory videos.
### Experiments
1. **Experimental Setup**: Experiments are conducted using the Waymo dataset, selecting eight highly dynamic interactive scenes.
2. **Quantitative Results**: DriveDreamer4D significantly outperforms baseline methods in NTA-IoU and NTL-IoU metrics, especially in complex operations such as lane changing, acceleration, and deceleration.
3. **Qualitative Results**: User studies further validate the rendering quality of new trajectory viewpoints by DriveDreamer4D, with users showing a clear preference for DriveDreamer4D over baseline methods.
### Conclusion
DriveDreamer4D through