VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder

Zhicong Tang,Shuyang Gu,Chunyu Wang,Ting Zhang,Jianmin Bao,Dong Chen,Baining Guo
2024-08-13
Abstract:This paper introduces a pioneering 3D volumetric encoder designed for text-to-3D generation. To scale up the training data for the diffusion model, a lightweight network is developed to efficiently acquire feature volumes from multi-view images. The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net. This research further addresses the challenges of inaccurate object captions and high-dimensional feature volumes. The proposed model, trained on the public Objaverse dataset, demonstrates promising outcomes in producing diverse and recognizable samples from text prompts. Notably, it empowers finer control over object part characteristics through textual cues, fostering model creativity by seamlessly combining multiple concepts within a single object. This research significantly contributes to the progress of 3D generation by introducing an efficient, flexible, and scalable representation methodology.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a new method to address the text-to-3D generation problem, specifically by developing an efficient 3D voxel encoder and a corresponding diffusion model for high-quality 3D object generation. ### Problems the Paper Attempts to Solve 1. **Efficient Data Representation**: Existing 3D representation methods, such as Tri-plane and Implicit Neural Representations (INRs), are inefficient or difficult to finely interact with text prompts when handling large-scale datasets. The paper aims to propose a new 3D voxel representation method that can efficiently capture feature volumes from multi-view images and flexibly interact with text prompts. 2. **Expanding Training Data**: To improve the performance of the diffusion model, a large amount of training data is required. The paper proposes a lightweight network to quickly extract feature volumes from multi-view images to expand the scale of the training dataset. 3. **Handling High-Dimensional Feature Volumes**: The feature volumes mentioned in the paper usually have very high dimensions, which poses challenges for the training of the diffusion model. The paper designs a new noise scheduling strategy and a low-frequency noise strategy to effectively handle the information in high-dimensional feature volumes. 4. **Inaccurate Object Descriptions**: The descriptions of objects in existing datasets are often inaccurate, which can lead to unstable training. The paper mitigates this impact by designing a new filtering scheme. 5. **Creative Design Capability**: The proposed method enables the model to better control the characteristics of object parts through text prompts, thereby promoting creative design, i.e., integrating multiple concepts into one object. 6. **Model Controllability and Diversity**: The paper also aims to precisely control different parts of the 3D object through text prompts and generate diverse samples. ### Summary This research aims to overcome the limitations of existing technologies in the text-to-3D generation task by proposing a novel 3D voxel representation method and a corresponding diffusion model, achieving more efficient, flexible, and controllable 3D object generation. Additionally, the paper addresses key issues such as insufficient training data, difficulties in handling high-dimensional feature volumes, and inaccurate object descriptions, thereby significantly improving the quality and diversity of the generated results.