AR4D: Autoregressive 4D Generation from Monocular Videos

Hanxin Zhu,Tianyu He,Xiqian Yu,Junliang Guo,Zhibo Chen,Jiang Bian
2025-01-03
Abstract:Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are as follows: The existing 4D generation methods based on Score Distillation Sampling (SDS) have problems such as limited diversity, spatio - temporal inconsistency, and poor alignment with input prompts. These problems lead to low - quality 4D object generation, especially when dealing with monocular videos. Specifically: 1. **Limited diversity**: The content generated by the SDS method often lacks diversity. 2. **Spatio - temporal inconsistency**: The generated 4D content may be inconsistent in space and time, affecting the overall quality and realism. 3. **Poor alignment with input prompts**: The generated content may be inconsistent with the input prompts or reference videos. To solve these problems, the authors propose AR4D (Autoregressive 4D Generation from Monocular Videos), a new paradigm that does not rely on SDS and aims to generate high - quality 4D content from monocular videos. AR4D achieves this goal through three stages: initialization, generation, and optimization. ### Main contributions 1. **Proposing AR4D**: A new paradigm that can generate high - quality 4D assets from monocular videos and avoid the limitations of SDS. 2. **Autoregressive generation**: Use the local deformation field to autoregressively generate the 3D representation of each frame, and further improve it through a progressive view sampling strategy to achieve accurate geometric and motion estimation. 3. **Optimization stage**: To alleviate the problem of cumulative error, an optimization stage based on the global deformation field and the geometric structure extracted from the 3D representation of each frame is proposed to ensure the spatio - temporal consistency of the generated 4D content. 4. **Experimental verification**: Extensive experiments show that AR4D can achieve state - of - the - art performance without relying on SDS, with higher diversity, better spatio - temporal consistency, and better alignment with input prompts. ### Method overview #### Initialization stage - Use a pre - trained multi - view diffusion model to generate multiple new views of the first frame, and then use a large - scale 3D reconstruction model to recover the corresponding 3D representation (i.e., 3D Gaussian distribution). - Fine - tune the obtained 3D Gaussian distribution to better capture the fine - grained texture details of the reference frame. #### Generation stage - Utilize the autoregressive property, assume that the 3D Gaussian distribution of the current frame is mainly affected by the previous frame, and model the deformation between adjacent frames through an independent MLP local deformation field. - Introduce a progressive view sampling strategy, use a pre - trained large - scale 3D reconstruction model to generate pseudo - new views as additional supervision to prevent overfitting. #### Optimization stage - Observe that the geometric structure is relatively stable, take the 3D Gaussian distribution of the first frame as the canonical space, construct a global deformation field, ensure that the geometric deformation of each frame is within a controllable range, and reduce appearance drift. Through these innovations, AR4D can significantly improve the quality and robustness of 4D generation and solve the key problems existing in the existing methods.