4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Heng Yu,Chaoyang Wang,Peiye Zhuang,Willi Menapace,Aliaksandr Siarohin,Junli Cao,Laszlo A Jeni,Sergey Tulyakov,Hsin-Ying Lee
2024-06-12
Abstract:Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key issues present in existing dynamic scene generation methods: 1. **Object-Centricity**: Existing 4D generation pipelines often rely on image, multi-view, and video generation models as priors to synthesize 4D samples due to the lack of 4D data. However, these multi-view models are fine-tuned on static and synthetic 3D assets, resulting in generated 4D outcomes that are primarily object-centric, lacking realism, and limited in capturing complex interactions between objects and the environment. 2. **Lack of Realism**: The scenes generated by existing methods often lack realism, especially when dealing with dynamic objects and complex interactions. 3. **Dependence on Specific Datasets**: Existing methods typically rely on specific datasets for fine-tuning, which limits their diversity and generalization capabilities. To address these issues, the paper proposes a new pipeline—4Real, for generating realistic text-to-4D scenes. 4Real addresses the aforementioned problems in the following ways: - **No Dependence on Multi-View Generation Models**: 4Real discards the reliance on multi-view generation models and instead leverages video generation models trained on large-scale real-world videos, covering more diverse and general appearances, shapes, motions, and interactions between objects and the environment. - **Improved Generation Quality and Diversity**: 4Real provides more use cases, generates more diverse results, and requires fewer computational resources. - **Using Deformable 3D Gaussian Splatting (D-3DGS) to Represent Dynamic Scenes**: 4Real employs D-3DGS as the representation of dynamic scenes, reconstructing canonical 3D representations and temporal deformations by generating reference videos and frozen-time videos, thus generating dynamic scenes with realism and structural integrity. Through these improvements, 4Real is capable of generating dynamic scenes with near-realistic quality from different viewpoints and time points, setting a new standard in the field of 4D scene generation.