IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

Luke Melas-Kyriazi,Iro Laina,Christian Rupprecht,Natalia Neverova,Andrea Vedaldi,Oran Gafni,Filippos Kokkinos

2024-02-14

Abstract:Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In this paper, we further explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Combined with a 3D reconstruction algorithm which, by using Gaussian splatting, can optimize a robust image-based loss, we directly produce high-quality 3D outputs from the generated views. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x, resulting in a much more efficient pipeline, better quality, fewer geometric inconsistencies, and higher yield of usable 3D assets.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the inefficiency, geometric inconsistency, and poor generation quality existing in the current text - to - 3D generation models. Specifically: 1. **Efficiency problem**: Existing methods based on Score Distillation Sampling (SDS) require thousands of evaluations of 2D generators, resulting in the generation of a single 3D asset possibly taking several hours. 2. **Geometric inconsistency**: Since the 2D generator itself does not have 3D perception ability, using SDS to gradually match 3D objects from different perspectives with 2D models is prone to introducing geometric inconsistency. 3. **Generation quality**: These methods are prone to producing artifacts and may fail to converge, resulting in the final generated 3D assets being of low quality. To solve these problems, the paper proposes the IM - 3D method, which is improved in the following ways: - **Multi - view video generation**: Upgrade the traditional text - to - image generator to a text - to - video generator (such as Emu Video) to generate a higher - quality and more consistent multi - perspective image sequence. - **Fast and robust 3D reconstruction**: Utilize techniques such as Gaussian splatting to directly reconstruct 3D models from the generated multi - perspective images, avoiding the complex distillation process. - **Iterative optimization**: Iteratively feed the 3D reconstruction results back to the 2D generator to further improve the generation quality and consistency. Through these improvements, IM - 3D significantly reduces the number of evaluations of the 2D generator (by 10 - 100 times), improves the generation speed and quality, and reduces geometric inconsistency.

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

V3D: Video Diffusion Models are Effective 3D Generators

SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image using Latent Video Diffusion

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

MVDiffusion++: A Dense High-resolution Multi-view Diffusion Model for Single or Sparse-view 3D Object Reconstruction

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

MVGaussian: High-Fidelity text-to-3D Content Generation with Multi-View Guidance and Surface Densification

Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation

Sharp-It: A Multi-view to Multi-view Diffusion Model for 3D Synthesis and Manipulation

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

HiFi-123: Towards High-fidelity One Image to 3D Content Generation