Abstract:Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key issues present in existing dynamic scene generation methods: 1. **Object-Centricity**: Existing 4D generation pipelines often rely on image, multi-view, and video generation models as priors to synthesize 4D samples due to the lack of 4D data. However, these multi-view models are fine-tuned on static and synthetic 3D assets, resulting in generated 4D outcomes that are primarily object-centric, lacking realism, and limited in capturing complex interactions between objects and the environment. 2. **Lack of Realism**: The scenes generated by existing methods often lack realism, especially when dealing with dynamic objects and complex interactions. 3. **Dependence on Specific Datasets**: Existing methods typically rely on specific datasets for fine-tuning, which limits their diversity and generalization capabilities. To address these issues, the paper proposes a new pipeline—4Real, for generating realistic text-to-4D scenes. 4Real addresses the aforementioned problems in the following ways: - **No Dependence on Multi-View Generation Models**: 4Real discards the reliance on multi-view generation models and instead leverages video generation models trained on large-scale real-world videos, covering more diverse and general appearances, shapes, motions, and interactions between objects and the environment. - **Improved Generation Quality and Diversity**: 4Real provides more use cases, generates more diverse results, and requires fewer computational resources. - **Using Deformable 3D Gaussian Splatting (D-3DGS) to Represent Dynamic Scenes**: 4Real employs D-3DGS as the representation of dynamic scenes, reconstructing canonical 3D representations and temporal deformations by generating reference videos and frozen-time videos, thus generating dynamic scenes with realism and structural integrity. Through these improvements, 4Real is capable of generating dynamic scenes with near-realistic quality from different viewpoints and time points, setting a new standard in the field of 4D scene generation.

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models

EG4D: Explicit Generation of 4D Object without Score Distillation

4Dynamic: Text-to-4D Generation with Hybrid Priors

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

AR4D: Autoregressive 4D Generation from Monocular Videos

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

GenXD: Generating Any 3D and 4D Scenes

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models

V3D: Video Diffusion Models are Effective 3D Generators