Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

Hao Wen,Zehuan Huang,Yaohui Wang,Xinyuan Chen,Yu Qiao,Lu Sheng

2024-06-05

Abstract:Existing single image-to-3D creation methods typically involve a two-stage process, first generating multi-view images, and then using these images for 3D reconstruction. However, training these two stages separately leads to significant data bias in the inference phase, thus affecting the quality of reconstructed results. We introduce a unified 3D generation framework, named Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. In our framework, these two modules are jointly trained through a self-conditioning mechanism, allowing them to adapt to each other's characteristics for robust inference. During the multi-view denoising process, the multi-view diffusion model uses the 3D-aware maps rendered by the reconstruction module at the previous timestep as additional conditions. The recursive diffusion framework with 3D-aware feedback unites the entire process and improves geometric consistency.Experiments show that our framework outperforms separation of these two stages and existing methods that combine them at the inference phase. Project page: <a class="link-external link-https" href="https://costwen.github.io/Ouroboros3D/" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem addressed in this paper is how to solve the issue of data bias and degradation in reconstruction quality caused by the two-stage process of generating multi-view images before 3D reconstruction in existing single-image to 3D creation methods. The paper introduces a unified 3D generation framework called Ouroboros3D, which integrates diffusion-based multi-view image generation and 3D reconstruction into a recursive diffusion process. Traditional methods train these two stages separately, but this leads to significant data bias during inference and affects the quality of reconstruction results. Ouroboros3D jointly trains these two modules through a self-conditional mechanism, enabling them to adapt to each other's characteristics and achieve more robust inference. In the multi-view denoising process, the multi-view diffusion model uses the 3D perception map rendered by the previous time step reconstruction module as an additional condition to enhance geometric consistency. By combining the recursive diffusion framework with 3D perception feedback, Ouroboros3D improves the overall consistency and enhances geometric consistency. Experiments show that this framework outperforms independent two-stage methods and existing methods that only combine these stages during inference. In short, Ouroboros3D aims to address the issues of multi-view consistency, data bias, and reconstruction quality in single-image to 3D content generation through a comprehensive recursive process.

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

Wonder3D: Single Image to 3D Using Cross-Domain Diffusion

O$^2$-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

Diffusion Time-step Curriculum for One Image to 3D Generation

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Boosting3D: High-Fidelity Image-to-3D by Boosting 2D Diffusion Prior to 3D Prior with Progressive Learning

Denoising Diffusion via Image-Based Rendering

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Enhanced 3D Generation by 2D Editing

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image