MVDiff: Scalable and Flexible Multi-View Diffusion for 3D Object Reconstruction from Single-View

Emmanuelle Bourigault,Pauline Bourigault
2024-06-13
Abstract:Generating consistent multiple views for 3D reconstruction tasks is still a challenge to existing image-to-3D diffusion models. Generally, incorporating 3D representations into diffusion model decrease the model's speed as well as generalizability and quality. This paper proposes a general framework to generate consistent multi-view images from single image or leveraging scene representation transformer and view-conditioned diffusion model. In the model, we introduce epipolar geometry constraints and multi-view attention to enforce 3D consistency. From as few as one image input, our model is able to generate 3D meshes surpassing baselines methods in evaluation metrics, including PSNR, SSIM and LPIPS.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper aims to solve the problem of generating high - quality multi - view images from a single view for 3D reconstruction. Specifically, existing diffusion - model - based methods face challenges in generating consistent multi - view images, especially in terms of speed, generalization ability, and image quality. The paper proposes a new framework, MVDiff, which improves the consistency of multi - view generation and the quality of 3D reconstruction by introducing epipolar geometry constraints and a multi - view attention mechanism. ### Main Problems and Solutions 1. **Problems**: - **Consistency Problem**: Existing methods have difficulty maintaining consistency between views when generating multi - view images, resulting in poor 3D reconstruction quality. - **Efficiency Problem**: Existing methods are slow when processing large - scale datasets and have limited generalization ability. - **Detail Problem**: The generated 3D models often lack details, especially when dealing with complex shapes. 2. **Solutions**: - **Epipolar Geometry Constraints**: By introducing epipolar geometry constraints, the model can learn the geometric correspondence between views during the training process, thereby improving the consistency of the generated images. - **Multi - view Attention Mechanism**: Using a multi - view self - attention mechanism ensures that features between different views can be effectively aggregated, further enhancing view consistency. - **Scene Representation Transformer (SRT)**: Through SRT, an implicit 3D representation is learned, and given the input view and camera parameters, high - quality multi - view images can be generated. - **Conditional Diffusion Model**: Combining the conditional diffusion model generates multi - view - consistent images, thereby improving the quality of 3D reconstruction. ### Experimental Results - **Novel View Synthesis**: On the GSO and NeRF synthesis datasets, MVDiff outperforms the baseline methods in evaluation metrics such as PSNR, SSIM, and LPIPS. In particular, as the number of reference views increases, the performance improves significantly. - **3D Generation**: On the GSO dataset, the 3D models generated by MVDiff are more visually consistent and have more details, especially when dealing with complex shapes. ### Conclusion By introducing epipolar geometry constraints and a multi - view attention mechanism, MVDiff successfully solves the problem of generating high - quality multi - view images from a single view for 3D reconstruction. This framework not only improves the consistency of the generated images and the quality of 3D reconstruction but also has good generalization ability and high efficiency. Future work can further explore combining lighting and texture knowledge to generate more diverse 3D shapes.