Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

Cheng Chen,Xiaofeng Yang,Fan Yang,Chengzeng Feng,Zhoujie Fu,Chuan-Sheng Foo,Guosheng Lin,Fayao Liu
2024-03-14
Abstract:Recent works on text-to-3d generation show that using only 2D diffusion supervision for 3D generation tends to produce results with inconsistent appearances (e.g., faces on the back view) and inaccurate shapes (e.g., animals with extra legs). Existing methods mainly address this issue by retraining diffusion models with images rendered from 3D data to ensure multi-view consistency while struggling to balance 2D generation quality with 3D consistency. In this paper, we present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model. Specifically, we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach. Moreover, to ensure accurate appearances of different views, we further modulate the output of the 2D diffusion model to the correct patterns of the template views without altering the generated object's style. These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show our method can largely improve the multi-view consistency while retaining fidelity and diversity. Our project page is available at:
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of **multi - view consistency and shape accuracy in text - to - 3D generation**. Specifically, existing methods often generate results with the following problems when only using 2D diffusion models to supervise 3D generation: 1. **Inconsistent appearance**: For example, a face is generated in the rear view. 2. **Inaccurate shape**: For example, an animal is generated with extra legs. To solve these problems, the authors propose a new framework, **Sculpt3D**, which explicitly injects 3D prior information obtained from retrieved reference objects without retraining the 2D diffusion model. Specific contributions include: - **Explicit integration of geometric and appearance information**: Through sparse ray sampling techniques and template appearance modulation, ensure that the generated 3D objects have consistent and accurate appearance and shape in multiple views. - **Creative point growth and pruning**: During the 2D diffusion and 3D geometry co - supervision process, allow the 2D diffusion model to generate both accurate and creative shapes. - **Improved multi - view consistency**: Experiments show that this method can significantly improve the multi - view consistency of text - to - 3D generation while maintaining the diversity and fidelity of the generated quality. ### Main challenges 1. **Balance of shape constraints**: Too strict shape constraints will make the generated results too close to the template, while too loose constraints cannot guarantee a reasonable shape. 2. **Appearance consistency**: Even if the shape is accurate, the model may still generate the wrong appearance in some difficult cases. ### Solutions 1. **Sparse ray sampling technique**: Selectively supervise a small number of key points to describe the overall structure, enabling the 2D diffusion model to freely exert its generation ability in an unconstrained space. 2. **Template appearance modulation**: Use a lightweight adapter to convert the template appearance into a form that matches the style of the generated object, in order to correct the appearance of the generated object without affecting its overall style. Through these designs, Sculpt3D can significantly improve the multi - view consistency and accuracy of generated 3D objects while maintaining the high - quality generation ability of the 2D diffusion model.