Consistent4D: Consistent 360° Dynamic Object Generation from Monocular Video

Yanqin Jiang,Li Zhang,Jin Gao,Weimin Hu,Yao Yao
2023-11-06
Abstract:In this paper, we present Consistent4D, a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. Uniquely, we cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration. This is achieved by leveraging the object-level 3D-aware image diffusion model as the primary supervision signal for training Dynamic Neural Radiance Fields (DyNeRF). Specifically, we propose a Cascade DyNeRF to facilitate stable convergence and temporal continuity under the supervision signal which is discrete along the time axis. To achieve spatial and temporal consistency, we further introduce an Interpolation-driven Consistency Loss. It is optimized by minimizing the discrepancy between rendered frames from DyNeRF and interpolated frames from a pre-trained video interpolation model. Extensive experiments show that our Consistent4D can perform competitively to prior art alternatives, opening up new possibilities for 4D dynamic object generation from monocular videos, whilst also demonstrating advantage for conventional text-to-3D generation tasks. Our project page is <a class="link-external link-https" href="https://consistent4d.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of generating consistent 360-degree dynamic objects (4D objects) from monocular videos. Specifically, the authors propose a new method called Consistent4D, which aims to generate 4D dynamic objects from uncalibrated monocular videos without the need for cumbersome multi-view data acquisition and camera calibration. ### Main Challenges 1. **Spatial and Temporal Consistency**: The generated 4D dynamic objects need to maintain consistency in both space and time. 2. **Lack of Multi-view Information**: Monocular videos cannot provide effective multi-view information, making traditional multi-view reconstruction methods difficult to apply. 3. **High-Quality Rendering**: The generated 4D dynamic objects need to have high-quality visual effects, including clear details and natural motion. ### Solutions 1. **Cascade Dynamic Neural Radiance Field (Cascade DyNeRF)**: By designing a special cascade structure, using a pre-trained 2D diffusion model as the main supervision signal, the training of DyNeRF is optimized to achieve stable convergence and temporal continuity. 2. **Interpolation-driven Consistency Loss (ICL)**: Introducing ICL loss to improve spatial and temporal consistency by minimizing the difference between DyNeRF rendered frames and interpolated frames from a pre-trained video interpolation model. 3. **Lightweight Video Enhancer**: Training a lightweight video enhancer to further improve the quality of videos rendered from DyNeRF, reducing artifacts such as blurry edges and floating objects. ### Experimental Results - **Quantitative Evaluation**: Extensive experiments on multiple synthetic and real-world videos show that Consistent4D outperforms other methods in terms of LPIPS and CLIP scores. - **Qualitative Evaluation**: The generated 4D dynamic objects perform better in novel views compared to D-NeRF and K-planes methods, especially in the absence of multi-view information. ### Contributions 1. Proposing a new framework for generating 4D dynamic objects from static monocular videos, specifically designing Cascade DyNeRF to represent dynamic objects and optimizing it with SDS loss. 2. Introducing Interpolation-driven Consistency Loss (ICL), significantly improving spatial and temporal consistency in 4D generation tasks. 3. Further enhancing the quality of generated videos through a lightweight video enhancer, demonstrating potential in video-to-4D generation tasks. ### Conclusion The proposed method in this paper makes significant progress in generating 4D dynamic objects from monocular videos, addressing issues of spatial and temporal consistency while improving the quality of generated objects, providing new possibilities for future research in dynamic scene generation.