Abstract:This paper introduces a pioneering 3D volumetric encoder designed for text-to-3D generation. To scale up the training data for the diffusion model, a lightweight network is developed to efficiently acquire feature volumes from multi-view images. The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net. This research further addresses the challenges of inaccurate object captions and high-dimensional feature volumes. The proposed model, trained on the public Objaverse dataset, demonstrates promising outcomes in producing diverse and recognizable samples from text prompts. Notably, it empowers finer control over object part characteristics through textual cues, fostering model creativity by seamlessly combining multiple concepts within a single object. This research significantly contributes to the progress of 3D generation by introducing an efficient, flexible, and scalable representation methodology.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new method to address the text-to-3D generation problem, specifically by developing an efficient 3D voxel encoder and a corresponding diffusion model for high-quality 3D object generation. ### Problems the Paper Attempts to Solve 1. **Efficient Data Representation**: Existing 3D representation methods, such as Tri-plane and Implicit Neural Representations (INRs), are inefficient or difficult to finely interact with text prompts when handling large-scale datasets. The paper aims to propose a new 3D voxel representation method that can efficiently capture feature volumes from multi-view images and flexibly interact with text prompts. 2. **Expanding Training Data**: To improve the performance of the diffusion model, a large amount of training data is required. The paper proposes a lightweight network to quickly extract feature volumes from multi-view images to expand the scale of the training dataset. 3. **Handling High-Dimensional Feature Volumes**: The feature volumes mentioned in the paper usually have very high dimensions, which poses challenges for the training of the diffusion model. The paper designs a new noise scheduling strategy and a low-frequency noise strategy to effectively handle the information in high-dimensional feature volumes. 4. **Inaccurate Object Descriptions**: The descriptions of objects in existing datasets are often inaccurate, which can lead to unstable training. The paper mitigates this impact by designing a new filtering scheme. 5. **Creative Design Capability**: The proposed method enables the model to better control the characteristics of object parts through text prompts, thereby promoting creative design, i.e., integrating multiple concepts into one object. 6. **Model Controllability and Diversity**: The paper also aims to precisely control different parts of the 3D object through text prompts and generate diverse samples. ### Summary This research aims to overcome the limitations of existing technologies in the text-to-3D generation task by proposing a novel 3D voxel representation method and a corresponding diffusion model, achieving more efficient, flexible, and controllable 3D object generation. Additionally, the paper addresses key issues such as insufficient training data, difficulties in handling high-dimensional feature volumes, and inaccurate object descriptions, thereby significantly improving the quality and diversity of the generated results.

VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder

Diffusion-SDF: Text-to-Shape Via Voxelized Diffusion

Vox-E: Text-guided Voxel Editing of 3D Objects

Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation

GVGEN: Text-to-3D Generation with Volumetric Representation

Unleashing Text-to-Image Diffusion Models for Visual Perception

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration