Controlling Space and Time with Diffusion Models

Daniel Watson,Saurabh Saxena,Lala Li,Andrea Tagliasacchi,David J. Fleet

2024-07-11

Abstract:We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the same. We further advocate the calibration of SfM posed data using monocular metric depth estimators for metric scale camera control. For model evaluation, we introduce new metrics to enrich and overcome shortcomings of current evaluation schemes, demonstrating state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS, while at the same time adding the ability to handle temporal dynamics. 4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks. For an overview see <a class="link-external link-https" href="https://4d-diffusion.github.io" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper mainly discusses how to use diffusion models to solve the problem of 4D novel view synthesis, especially in dealing with general scenes with dynamic elements. Existing models often focus on objects with limited camera poses and static backgrounds, while this paper proposes a new approach called 4DiM, which can handle scenes, free-form camera poses, and temporal control. 4DiM is a cascade diffusion model based on one or more scene images, camera poses, and timestamps to generate 4D novel views. Due to the limitations of 4D training data, the researchers propose joint training on 3D (with camera poses), 4D (pose + time), and video data, and design a new architecture to achieve this goal. In addition, they also introduce a version of SfM data calibrated using monocular rangefinder to improve the model's scale control. In the paper, the authors introduce new evaluation metrics to address the limitations of existing evaluation schemes, demonstrate the latest results of 4DiM in fidelity and pose control, and its ability to handle temporal dynamics. 4DiM is not only used for sample generation, but also applicable to video-to-video translation, panorama stitching, and other tasks. The main contributions include: 1. Extending from objects to scenes; 2. Allowing free-form camera pose control; 3. Achieving spatial and temporal synchronous control through conditioning on camera poses and timestamps. In the experimental section, 4DiM demonstrates better performance than existing diffusion models in various benchmark tests, especially in terms of 3D consistency, pose alignment, and motion capture of dynamic content. The paper also discusses the importance of joint training with video data and the impact of calibration data on improving model performance.

Controlling Space and Time with Diffusion Models

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Self-Calibrating 4D Novel View Synthesis from Monocular Videos Using Gaussian Splatting

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer

Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

Scalable Diffusion Models with State Space Backbone

S4D: Streaming 4D Real-World Reconstruction with Gaussians and 3D Control Points

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

3D-free meets 3D priors: Novel View Synthesis from a Single Image with Pretrained Diffusion Guidance