Abstract:In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training Dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an Interpolation-driven Consistency Loss. It is optimized by minimizing the discrepancy between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks. Our project page is <a class="link-external link-https" href="https://consistent4d.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The paper attempts to address the problem of generating consistent 360-degree dynamic objects (4D objects) from monocular videos. Specifically, the authors propose a new method called Consistent4D, which aims to generate 4D dynamic objects from uncalibrated monocular videos without the need for cumbersome multi-view data acquisition and camera calibration. ### Main Challenges 1. **Spatial and Temporal Consistency**: The generated 4D dynamic objects need to maintain consistency in both space and time. 2. **Lack of Multi-view Information**: Monocular videos cannot provide effective multi-view information, making traditional multi-view reconstruction methods difficult to apply. 3. **High-Quality Rendering**: The generated 4D dynamic objects need to have high-quality visual effects, including clear details and natural motion. ### Solutions 1. **Cascade Dynamic Neural Radiance Field (Cascade DyNeRF)**: By designing a special cascade structure, using a pre-trained 2D diffusion model as the main supervision signal, the training of DyNeRF is optimized to achieve stable convergence and temporal continuity. 2. **Interpolation-driven Consistency Loss (ICL)**: Introducing ICL loss to improve spatial and temporal consistency by minimizing the difference between DyNeRF rendered frames and interpolated frames from a pre-trained video interpolation model. 3. **Lightweight Video Enhancer**: Training a lightweight video enhancer to further improve the quality of videos rendered from DyNeRF, reducing artifacts such as blurry edges and floating objects. ### Experimental Results - **Quantitative Evaluation**: Extensive experiments on multiple synthetic and real-world videos show that Consistent4D outperforms other methods in terms of LPIPS and CLIP scores. - **Qualitative Evaluation**: The generated 4D dynamic objects perform better in novel views compared to D-NeRF and K-planes methods, especially in the absence of multi-view information. ### Contributions 1. Proposing a new framework for generating 4D dynamic objects from static monocular videos, specifically designing Cascade DyNeRF to represent dynamic objects and optimizing it with SDS loss. 2. Introducing Interpolation-driven Consistency Loss (ICL), significantly improving spatial and temporal consistency in 4D generation tasks. 3. Further enhancing the quality of generated videos through a lightweight video enhancer, demonstrating potential in video-to-4D generation tasks. ### Conclusion The proposed method in this paper makes significant progress in generating 4D dynamic objects from monocular videos, addressing issues of spatial and temporal consistency while improving the quality of generated objects, providing new possibilities for future research in dynamic scene generation.

Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

EG4D: Explicit Generation of 4D Object without Score Distillation

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video

Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

PaintScene4D: Consistent 4D Scene Generation from Text Prompts

4DRecons: 4D Neural Implicit Deformable Objects Reconstruction from a single RGB-D Camera with Geometrical and Topological Regularizations

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

4Dynamic: Text-to-4D Generation with Hybrid Priors

DRSM: efficient neural 4d decomposition for dynamic reconstruction in stationary monocular cameras

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models