Daniel Watson,Saurabh Saxena,Lala Li,Andrea Tagliasacchi,David J. Fleet
Abstract:We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), conditioned on one or more images of a general scene, and a set of camera poses and timestamps. To overcome challenges due to limited availability of 4D training data, we advocate joint training on 3D (with camera pose), 4D (pose+time) and video (time but no pose) data and propose a new architecture that enables the same. We further advocate the calibration of SfM posed data using monocular metric depth estimators for metric scale camera control. For model evaluation, we introduce new metrics to enrich and overcome shortcomings of current evaluation schemes, demonstrating state-of-the-art results in both fidelity and pose control compared to existing diffusion models for 3D NVS, while at the same time adding the ability to handle temporal dynamics. 4DiM is also used for improved panorama stitching, pose-conditioned video to video translation, and several other tasks. For an overview see <a class="link-external link-https" href="https://4d-diffusion.github.io" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
The paper mainly discusses how to use diffusion models to solve the problem of 4D novel view synthesis, especially in dealing with general scenes with dynamic elements. Existing models often focus on objects with limited camera poses and static backgrounds, while this paper proposes a new approach called 4DiM, which can handle scenes, free-form camera poses, and temporal control.
4DiM is a cascade diffusion model based on one or more scene images, camera poses, and timestamps to generate 4D novel views. Due to the limitations of 4D training data, the researchers propose joint training on 3D (with camera poses), 4D (pose + time), and video data, and design a new architecture to achieve this goal. In addition, they also introduce a version of SfM data calibrated using monocular rangefinder to improve the model's scale control.
In the paper, the authors introduce new evaluation metrics to address the limitations of existing evaluation schemes, demonstrate the latest results of 4DiM in fidelity and pose control, and its ability to handle temporal dynamics. 4DiM is not only used for sample generation, but also applicable to video-to-video translation, panorama stitching, and other tasks.
The main contributions include:
1. Extending from objects to scenes;
2. Allowing free-form camera pose control;
3. Achieving spatial and temporal synchronous control through conditioning on camera poses and timestamps.
In the experimental section, 4DiM demonstrates better performance than existing diffusion models in various benchmark tests, especially in terms of 3D consistency, pose alignment, and motion capture of dynamic content. The paper also discusses the importance of joint training with video data and the impact of calibration data on improving model performance.