Abstract:Recent 3D large reconstruction models typically employ a two-stage process, including first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content.However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. Concretely, 2D diffusion model is applied for generating high-quality texture, and the reconstruction model guarantees multi-view consistency.Moreover, 2D diffusion model can further control the generated content and inject reference-view information for unseen views, thereby enhancing the diversity and texture consistency of 3D generation during the denoising process. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baselines.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the issue of generating high-quality and consistent 3D assets. Specifically, existing large-scale 3D reconstruction models typically adopt a two-stage process: first, generating multi-view images through a multi-view diffusion model, and then reconstructing these images into 3D content using a feedforward model. However, multi-view diffusion models often produce low-quality and inconsistent images, resulting in poor final 3D reconstruction quality. To tackle these issues, the authors propose a unified 3D generation framework called Cycle3D. This framework cyclically uses a 2D diffusion generation model and a 3D reconstruction model during the multi-step diffusion process. The 2D diffusion model is responsible for generating high-quality textures, while the 3D reconstruction model ensures multi-view consistency. Additionally, the 2D diffusion model can control the content generation of unseen views and inject reference view information during the denoising process, thereby enhancing the diversity and texture consistency of the 3D generation. ### Main Contributions 1. **Unified Framework**: A unified image-to-3D generation framework, Cycle3D, is proposed, which cyclically uses a 2D diffusion model and a 3D reconstruction model during the multi-step diffusion process. The 2D diffusion model improves the quality of multi-view images, while the 3D reconstruction model enhances 3D consistency. 2. **Diversity and Texture Consistency**: The 2D diffusion model is used to control the content generation of unseen views and inject reference view information during the denoising process, thereby enhancing the diversity and texture consistency of the 3D generation. 3. **Experimental Validation**: Extensive experimental results demonstrate that this framework outperforms existing methods in the image-to-3D task, achieving high-quality and consistent 3D generation.

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

Chasing Consistency in Text-to-3D Generation from a Single Image.

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion

One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion

V3D: Video Diffusion Models are Effective 3D Generators

Envision3D: One Image to 3D with Anchor Views Interpolation

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

Diffusion Time-step Curriculum for One Image to 3D Generation

Diffusion$^2$: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation