Abstract:Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models. These models are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it remains challenging to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models separately that can provide satisfactory dynamic and geometric priors respectively. To take advantage of both, this paper presents Diffusion$^2$, a novel framework for dynamic 3D content creation that reconciles the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of pretrained video and multi-view diffusion models based on the probability structure of the target image array. To alleviate the potential conflicts between two heterogeneous scores, we further introduce variance-reducing sampling via interpolated steps, facilitating smooth and stable generation. Owing to the high parallelism of the proposed image generation process and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Notably, our method circumvents the reliance on expensive and hard-to-scale 4D data, thereby having the potential to benefit from the scaling of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework in generating highly seamless and consistent 4D assets under various types of conditions.

HOLODIFFUSION: Training a 3D Diffusion Model using 2D Images

HoloFusion: Towards Photo-realistic 3D Generative Modeling

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Diffusion Models in 3D Vision: A Survey

TetraDiffusion: Tetrahedral Diffusion Models for 3D Shape Generation

InsertDiffusion: Identity Preserving Visualization of Objects through a Training-Free Diffusion Architecture

Human 3Diffusion: Realistic Avatar Creation via Explicit 3D Consistent Diffusion Models

Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors

Extracting Training Data from Diffusion Models

VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models

Generating Images with 3D Annotations Using Diffusion Models

Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation

Improving 3D Imaging with Pre-Trained Perpendicular 2D Diffusion Models

DreamFusion: Text-to-3D using 2D Diffusion

Unified framework for diffusion generative models in SO(3): applications in computer vision and astrophysics

A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Denoising Diffusion via Image-Based Rendering

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models