Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

Zijie Pan,Zeyu Yang,Xiatian Zhu,Li Zhang

2024-07-22

Abstract:Generating dynamic 3D object from a single-view video is challenging due to the lack of 4D labeled data. An intuitive approach is to extend previous image-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling.However, this approach would be slow and expensive to scale due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly reconstruct the 4D content through a 4D Gaussian splatting model. Importantly, our method can achieve real-time rendering under continuous camera trajectories. To enable robust reconstruction under sparse views, we introduce inconsistency-aware confidence-weighted loss design, along with a lightly weighted score distillation loss. Extensive experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed when compared to prior art alternatives while preserving the quality of novel view synthesis. For example, Efficient4D takes only 10 minutes to model a dynamic object, vs 120 minutes by the previous art model Consistent4D.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem Addressed by the Paper This paper aims to tackle the challenge of generating dynamic 3D objects (i.e., 4D objects) from single-view videos. Specifically, existing methods face the following issues when generating high-quality, spatiotemporally consistent 4D objects: 1. **Lack of 4D annotated data**: Generating dynamic 3D objects requires a large amount of 4D annotated data, which is difficult to obtain. 2. **Low computational efficiency**: Existing methods typically rely on large pre-trained models and require backpropagation to pass supervision signals, leading to low computational efficiency and poor scalability. 3. **Poor spatiotemporal consistency**: The generated images often exhibit inconsistencies across different viewpoints and time points, affecting the quality of the final 4D objects. To overcome these issues, the authors propose an efficient video-to-4D object generation framework called Efficient4D. This framework can generate high-quality, spatiotemporally consistent images and directly use these images to reconstruct 4D content while achieving real-time rendering. Additionally, Efficient4D can robustly reconstruct under sparse viewpoints and significantly improve generation speed compared to existing methods. For example, Efficient4D can generate a dynamic object in only 10 minutes, whereas the previous state-of-the-art method, Consistent4D, requires 120 minutes.

Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video

Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

EG4D: Explicit Generation of 4D Object without Score Distillation

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from a Single Image

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

Dynamics-Aware Gaussian Splatting Streaming Towards Fast On-the-Fly Training for 4D Reconstruction

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting

SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

V3D: Video Diffusion Models are Effective 3D Generators

4K4D: Real-Time 4D View Synthesis at 4K Resolution

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Generating 3D-Consistent Videos from Unposed Internet Photos

High-Fidelity and Real-Time Novel View Synthesis for Dynamic Scenes.

STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians