Abstract:Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the difficult problem of efficiently generating high - quality 3D assets from text prompts. Specifically, the author points out the following key challenges: 1. **Data scarcity**: Compared with 2D images, the scale of 3D data sets is smaller. The most widely used 3D data sets only contain millions of assets, while 2D data sets (such as LAION - 5B) contain billions of text - image pairs. This has led to the problem of insufficient training data for 3D generation models. 2. **Long generation time**: Existing methods based on Score Distillation Sampling (SDS), such as DreamFusion, can generate high - quality 3D assets, but the generation time is very long, usually ranging from 20 minutes to several hours. 3. **Lack of diversity**: The 3D assets generated by existing methods have low diversity among different seeds and are prone to produce similar results. To solve these problems, the author proposes HexaGen3D, a new text - to - 3D generation model. This model significantly improves the generation efficiency and quality in the following ways: - **Utilizing pre - trained 2D diffusion models**: HexaGen3D adopts a pre - trained text - to - image model and fine - tunes it to predict six orthogonal projection views and the corresponding tri - plane latent representations. These latent representations are then decoded into textured meshes. - **Fast generation**: HexaGen3D can generate high - quality and diverse 3D objects within 7 seconds, greatly shortening the generation time. - **Strong generalization ability**: This model can handle various text prompts well and can be generalized to new objects or combinations not seen during training. Through these improvements, HexaGen3D not only far exceeds existing methods in generation speed, but also performs excellently in terms of quality and diversity, thus providing a more efficient and practical solution for 3D asset generation. ### Key contributions 1. **Introducing the "Orthogonal Six - View Guidance" technique**: Bridge 2D and 3D synthesis tasks by predicting six orthogonal views. 2. **Efficient feed - forward generation**: Generate 3D assets directly from feed - forward without per - sample optimization. 3. **Significantly improve generation speed**: Compared with existing methods, the generation speed of HexaGen3D is several orders of magnitude faster. ### Experimental results Experiments show that HexaGen3D outperforms other methods in multiple metrics, including generation time and visual quality. Specifically: - **Generation time**: HexaGen3D only takes 7 seconds, while MVDream takes about 3 hours. - **Visual quality**: A user preference survey shows that HexaGen3D scores higher in visual quality and text - prompt fidelity. - **Diversity**: The 3D objects generated by HexaGen3D have higher diversity. In summary, HexaGen3D provides a faster, higher - quality, and more diverse text - to - 3D generation method, solving the shortcomings of existing methods in terms of generation time and diversity.

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

3DGen: Triplane Latent Diffusion for Textured Mesh Generation

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation

Geometry Image Diffusion: Fast and Data-Efficient Text-to-3D with Image-Based Surface Representation

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Diverse and Stable 2D Diffusion Guided Text to 3D Generation with Noise Recalibration

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Guide3D: Create 3D Avatars from Text and Image Guidance

GradeADreamer: Enhanced Text-to-3D Generation Using Gaussian Splatting and Multi-View Diffusion

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model