VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Haiping Wang,Yuan Liu,Ziwei Liu,Wenping Wang,Zhen Dong,Bisheng Yang
2024-10-22
Abstract:In this paper, we propose VistaDream a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream begins with building a global coarse 3D scaffold by zooming out a little step with inpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multiview Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin. The code, videos, and interactive demos are available at <a class="link-external link-https" href="https://vistadream-project-page.github.io/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the problem of reconstructing 3D scenes from single-view images. Specifically, existing methods, while maintaining consistency between the input image and the generated image when creating new view images, struggle to maintain consistency among multiple generated images. This results in inconsistencies in the generated 3D scenes from different viewpoints. ### Main Contributions 1. **Two-Stage Framework**: - **Stage 1**: Construct a rough global 3D scaffold by expanding the field of view (FoV) and performing boundary padding and depth estimation to generate extended view images. - **Stage 2**: Introduce the Multiview Consistency Sampling (MCS) algorithm to resample multiview consistent images from a pre-trained diffusion model, optimizing the reconstructed 3D scene. 2. **Multiview Consistency Sampling (MCS)**: - The MCS algorithm introduces multiview consistency constraints during the reverse sampling process of the diffusion model, ensuring consistency among the generated multiview images, thereby improving the quality of the 3D scene. ### Experimental Results - **Qualitative Comparison**: Compared to existing methods (such as RealDreamer, GenWarp, CAT3D), VistaDream demonstrates significant advantages in the consistency and quality of multiview images in the generated 3D scenes. - **Quantitative Evaluation**: Evaluation results on multiple metrics (such as noise level, edge sharpness, structure, detail, overall quality) show that VistaDream surpasses other baseline methods without the need for fine-tuning and is comparable to the extensively trained CAT3D method. ### Conclusion VistaDream proposes a two-stage framework that successfully addresses the problem of reconstructing high-quality 3D scenes from single-view images by introducing a vision-language model-assisted global 3D scaffold and multiview consistency sampling. Experimental results indicate that this method can qualitatively and quantitatively outperform existing baseline methods without the need for fine-tuning.