Abstract:In this paper, we propose VistaDream a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream begins with building a global coarse 3D scaffold by zooming out a little step with inpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multiview Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves consistent and high-quality novel view synthesis using just single-view images and outperforms baseline methods by a large margin. The code, videos, and interactive demos are available at <a class="link-external link-https" href="https://vistadream-project-page.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the problem of reconstructing 3D scenes from single-view images. Specifically, existing methods, while maintaining consistency between the input image and the generated image when creating new view images, struggle to maintain consistency among multiple generated images. This results in inconsistencies in the generated 3D scenes from different viewpoints. ### Main Contributions 1. **Two-Stage Framework**: - **Stage 1**: Construct a rough global 3D scaffold by expanding the field of view (FoV) and performing boundary padding and depth estimation to generate extended view images. - **Stage 2**: Introduce the Multiview Consistency Sampling (MCS) algorithm to resample multiview consistent images from a pre-trained diffusion model, optimizing the reconstructed 3D scene. 2. **Multiview Consistency Sampling (MCS)**: - The MCS algorithm introduces multiview consistency constraints during the reverse sampling process of the diffusion model, ensuring consistency among the generated multiview images, thereby improving the quality of the 3D scene. ### Experimental Results - **Qualitative Comparison**: Compared to existing methods (such as RealDreamer, GenWarp, CAT3D), VistaDream demonstrates significant advantages in the consistency and quality of multiview images in the generated 3D scenes. - **Quantitative Evaluation**: Evaluation results on multiple metrics (such as noise level, edge sharpness, structure, detail, overall quality) show that VistaDream surpasses other baseline methods without the need for fine-tuning and is comparable to the extensively trained CAT3D method. ### Conclusion VistaDream proposes a two-stage framework that successfully addresses the problem of reconstructing high-quality 3D scenes from single-view images by introducing a vision-language model-assisted global 3D scaffold and multiview consistency sampling. Experimental results indicate that this method can qualitatively and quantitatively outperform existing baseline methods without the need for fine-tuning.

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Benchmarking Large-Scale Multi-View 3D Reconstruction Using Realistic Synthetic Images

Vista3D: Unravel the 3D Darkside of a Single Image

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

VI3DRM:Towards meticulous 3D Reconstruction from Sparse Views via Photo-Realistic Novel View Synthesis

Multi-View Depth Map Sampling for 3D Reconstruction of Natural Scene

DP-MVS: Detail Preserving Multi-View Surface Reconstruction of Large-Scale Scenes

Hybrid-MVS: Robust Multi-View Reconstruction with Hybrid Optimization of Visual and Depth Cues

MVDream: Multi-view Diffusion for 3D Generation

Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

FusionDreamer: Consistent Images Generation from Sparse-view Images

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

Multi-Viewpoint Panorama Construction with Wide-Baseline Images

Envision3D: One Image to 3D with Anchor Views Interpolation

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

MultiDiff: Consistent Novel View Synthesis from a Single Image

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation