3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Hansheng Chen,Bokui Shen,Yulin Liu,Ruoxi Shi,Linqi Zhou,Connor Z. Lin,Jiayuan Gu,Hao Su,Gordon Wetzstein,Leonidas Guibas
2024-10-25
Abstract:Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To address this challenge, we introduce 3D-Adapter, a plug-in module designed to infuse 3D geometry awareness into pretrained image diffusion models. Central to our approach is the idea of 3D feedback augmentation: for each denoising step in the sampling loop, 3D-Adapter decodes intermediate multi-view features into a coherent 3D representation, then re-encodes the rendered RGBD views to augment the pretrained base model through feature addition. We study two variants of 3D-Adapter: a fast feed-forward version based on Gaussian splatting and a versatile training-free version utilizing neural fields and meshes. Our extensive experiments demonstrate that 3D-Adapter not only greatly enhances the geometry quality of text-to-multi-view models such as Instant3D and Zero123++, but also enables high-quality 3D generation using the plain text-to-image Stable Diffusion. Furthermore, we showcase the broad application potential of 3D-Adapter by presenting high quality results in text-to-3D, image-to-3D, text-to-texture, and text-to-avatar tasks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the geometric consistency problem existing in the multi - view image diffusion process of existing 3D generation models. Specifically, most of the existing multi - view image diffusion models rely on 2D network architectures lacking inherent 3D bias, resulting in poor geometric consistency of the generated 3D objects. Although these models perform well in terms of global semantic consistency, they face challenges in local geometric consistency, such as imprecise 2D - 3D feature alignment and insufficient geometric rationality, thus producing floating artifacts or blurry, less - detailed 3D outputs. To address this challenge, the authors propose 3D - Adapter, which is a plug - in module designed to inject 3D geometric awareness into pre - trained image diffusion models. The core idea of 3D - Adapter is to decode the intermediate multi - view features into a coherent 3D representation at each denoising step of the sampling loop, and then re - encode the rendered RGBD views to enhance the pre - trained base model through feature addition. In this way, 3D - Adapter not only significantly improves the geometric quality of text - to - multi - view models (such as Instant3D and Zero123++), but also can generate high - quality 3D objects using the pure text - to - image Stable Diffusion. In addition, the paper also demonstrates the broad application potential of 3D - Adapter in text - to - 3D, image - to - 3D, text - to - texture, and text - to - avatar tasks, proving its flexibility and effectiveness by providing high - quality results.