Abstract:Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are as follows: The existing 4D generation methods based on Score Distillation Sampling (SDS) have problems such as limited diversity, spatio - temporal inconsistency, and poor alignment with input prompts. These problems lead to low - quality 4D object generation, especially when dealing with monocular videos. Specifically: 1. **Limited diversity**: The content generated by the SDS method often lacks diversity. 2. **Spatio - temporal inconsistency**: The generated 4D content may be inconsistent in space and time, affecting the overall quality and realism. 3. **Poor alignment with input prompts**: The generated content may be inconsistent with the input prompts or reference videos. To solve these problems, the authors propose AR4D (Autoregressive 4D Generation from Monocular Videos), a new paradigm that does not rely on SDS and aims to generate high - quality 4D content from monocular videos. AR4D achieves this goal through three stages: initialization, generation, and optimization. ### Main contributions 1. **Proposing AR4D**: A new paradigm that can generate high - quality 4D assets from monocular videos and avoid the limitations of SDS. 2. **Autoregressive generation**: Use the local deformation field to autoregressively generate the 3D representation of each frame, and further improve it through a progressive view sampling strategy to achieve accurate geometric and motion estimation. 3. **Optimization stage**: To alleviate the problem of cumulative error, an optimization stage based on the global deformation field and the geometric structure extracted from the 3D representation of each frame is proposed to ensure the spatio - temporal consistency of the generated 4D content. 4. **Experimental verification**: Extensive experiments show that AR4D can achieve state - of - the - art performance without relying on SDS, with higher diversity, better spatio - temporal consistency, and better alignment with input prompts. ### Method overview #### Initialization stage - Use a pre - trained multi - view diffusion model to generate multiple new views of the first frame, and then use a large - scale 3D reconstruction model to recover the corresponding 3D representation (i.e., 3D Gaussian distribution). - Fine - tune the obtained 3D Gaussian distribution to better capture the fine - grained texture details of the reference frame. #### Generation stage - Utilize the autoregressive property, assume that the 3D Gaussian distribution of the current frame is mainly affected by the previous frame, and model the deformation between adjacent frames through an independent MLP local deformation field. - Introduce a progressive view sampling strategy, use a pre - trained large - scale 3D reconstruction model to generate pseudo - new views as additional supervision to prevent overfitting. #### Optimization stage - Observe that the geometric structure is relatively stable, take the 3D Gaussian distribution of the first frame as the canonical space, construct a global deformation field, ensure that the geometric deformation of each frame is within a controllable range, and reduce appearance drift. Through these innovations, AR4D can significantly improve the quality and robustness of 4D generation and solve the key problems existing in the existing methods.

AR4D: Autoregressive 4D Generation from Monocular Videos

Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

4Dynamic: Text-to-4D Generation with Hybrid Priors

EG4D: Explicit Generation of 4D Object without Score Distillation

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

GenXD: Generating Any 3D and 4D Scenes

Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

4K4DGen: Panoramic 4D Generation at 4K Resolution

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion