Abstract:Recent works on text-to-3d generation show that using only 2D diffusion supervision for 3D generation tends to produce results with inconsistent appearances (e.g., faces on the back view) and inaccurate shapes (e.g., animals with extra legs). Existing methods mainly address this issue by retraining diffusion models with images rendered from 3D data to ensure multi-view consistency while struggling to balance 2D generation quality with 3D consistency. In this paper, we present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model. Specifically, we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach. Moreover, to ensure accurate appearances of different views, we further modulate the output of the 2D diffusion model to the correct patterns of the template views without altering the generated object's style. These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show our method can largely improve the multi-view consistency while retaining fidelity and diversity. Our project page is available at:

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of **multi - view consistency and shape accuracy in text - to - 3D generation**. Specifically, existing methods often generate results with the following problems when only using 2D diffusion models to supervise 3D generation: 1. **Inconsistent appearance**: For example, a face is generated in the rear view. 2. **Inaccurate shape**: For example, an animal is generated with extra legs. To solve these problems, the authors propose a new framework, **Sculpt3D**, which explicitly injects 3D prior information obtained from retrieved reference objects without retraining the 2D diffusion model. Specific contributions include: - **Explicit integration of geometric and appearance information**: Through sparse ray sampling techniques and template appearance modulation, ensure that the generated 3D objects have consistent and accurate appearance and shape in multiple views. - **Creative point growth and pruning**: During the 2D diffusion and 3D geometry co - supervision process, allow the 2D diffusion model to generate both accurate and creative shapes. - **Improved multi - view consistency**: Experiments show that this method can significantly improve the multi - view consistency of text - to - 3D generation while maintaining the diversity and fidelity of the generated quality. ### Main challenges 1. **Balance of shape constraints**: Too strict shape constraints will make the generated results too close to the template, while too loose constraints cannot guarantee a reasonable shape. 2. **Appearance consistency**: Even if the shape is accurate, the model may still generate the wrong appearance in some difficult cases. ### Solutions 1. **Sparse ray sampling technique**: Selectively supervise a small number of key points to describe the overall structure, enabling the 2D diffusion model to freely exert its generation ability in an unconstrained space. 2. **Template appearance modulation**: Use a lightweight adapter to convert the template appearance into a form that matches the style of the generated object, in order to correct the appearance of the generated object without affecting its overall style. Through these designs, Sculpt3D can significantly improve the multi - view consistency and accuracy of generated 3D objects while maintaining the high - quality generation ability of the 2D diffusion model.

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

Chasing Consistency in Text-to-3D Generation from a Single Image.

Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

SweetDreamer: Aligning Geometric Priors in 2D Diffusion for Consistent Text-to-3D

GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

SeMv-3D: Towards Semantic and Mutil-view Consistency simultaneously for General Text-to-3D Generation with Triplane Priors

Learning Pseudo 3D Guidance for View-consistent Texturing with 2D Diffusion

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

UniDream: Unifying Diffusion Priors for Relightable Text-to-3D Generation

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior

Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models