Abstract:Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.

What problem does this paper attempt to address?

This paper aims to solve the problem of generating high - quality multi - view images from a single view for 3D reconstruction. Specifically, existing diffusion - model - based methods face challenges in generating consistent multi - view images, especially in terms of speed, generalization ability, and image quality. The paper proposes a new framework, MVDiff, which improves the consistency of multi - view generation and the quality of 3D reconstruction by introducing epipolar geometry constraints and a multi - view attention mechanism. ### Main Problems and Solutions 1. **Problems**: - **Consistency Problem**: Existing methods have difficulty maintaining consistency between views when generating multi - view images, resulting in poor 3D reconstruction quality. - **Efficiency Problem**: Existing methods are slow when processing large - scale datasets and have limited generalization ability. - **Detail Problem**: The generated 3D models often lack details, especially when dealing with complex shapes. 2. **Solutions**: - **Epipolar Geometry Constraints**: By introducing epipolar geometry constraints, the model can learn the geometric correspondence between views during the training process, thereby improving the consistency of the generated images. - **Multi - view Attention Mechanism**: Using a multi - view self - attention mechanism ensures that features between different views can be effectively aggregated, further enhancing view consistency. - **Scene Representation Transformer (SRT)**: Through SRT, an implicit 3D representation is learned, and given the input view and camera parameters, high - quality multi - view images can be generated. - **Conditional Diffusion Model**: Combining the conditional diffusion model generates multi - view - consistent images, thereby improving the quality of 3D reconstruction. ### Experimental Results - **Novel View Synthesis**: On the GSO and NeRF synthesis datasets, MVDiff outperforms the baseline methods in evaluation metrics such as PSNR, SSIM, and LPIPS. In particular, as the number of reference views increases, the performance improves significantly. - **3D Generation**: On the GSO dataset, the 3D models generated by MVDiff are more visually consistent and have more details, especially when dealing with complex shapes. ### Conclusion By introducing epipolar geometry constraints and a multi - view attention mechanism, MVDiff successfully solves the problem of generating high - quality multi - view images from a single view for 3D reconstruction. This framework not only improves the consistency of the generated images and the quality of 3D reconstruction but also has good generalization ability and high efficiency. Future work can further explore combining lighting and texture knowledge to generate more diverse 3D shapes.

MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

MultiDiff: Consistent Novel View Synthesis from a Single Image

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

MVD$^2$: Efficient Multiview 3D Reconstruction for Multiview Diffusion

SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

Pragmatist: Multiview Conditional Diffusion Models for High-Fidelity 3D Reconstruction from Unposed Sparse Views

MVDream: Multi-view Diffusion for 3D Generation

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Envision3D: One Image to 3D with Anchor Views Interpolation

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

VistaDream: Sampling multiview consistent images for single-view scene reconstruction

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation

V3D: Video Diffusion Models are Effective 3D Generators