HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

Antoine Mercier,Ramin Nakhli,Mahesh Reddy,Rajeev Yasarla,Hong Cai,Fatih Porikli,Guillaume Berger
2024-01-15
Abstract:Despite the latest remarkable advances in generative modeling, efficient generation of high-quality 3D assets from textual prompts remains a difficult task. A key challenge lies in data scarcity: the most extensive 3D datasets encompass merely millions of assets, while their 2D counterparts contain billions of text-image pairs. To address this, we propose a novel approach which harnesses the power of large, pretrained 2D diffusion models. More specifically, our approach, HexaGen3D, fine-tunes a pretrained text-to-image model to jointly predict 6 orthographic projections and the corresponding latent triplane. We then decode these latents to generate a textured mesh. HexaGen3D does not require per-sample optimization, and can infer high-quality and diverse objects from textual prompts in 7 seconds, offering significantly better quality-to-latency trade-offs when comparing to existing approaches. Furthermore, HexaGen3D demonstrates strong generalization to new objects or compositions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the difficult problem of efficiently generating high - quality 3D assets from text prompts. Specifically, the author points out the following key challenges: 1. **Data scarcity**: Compared with 2D images, the scale of 3D data sets is smaller. The most widely used 3D data sets only contain millions of assets, while 2D data sets (such as LAION - 5B) contain billions of text - image pairs. This has led to the problem of insufficient training data for 3D generation models. 2. **Long generation time**: Existing methods based on Score Distillation Sampling (SDS), such as DreamFusion, can generate high - quality 3D assets, but the generation time is very long, usually ranging from 20 minutes to several hours. 3. **Lack of diversity**: The 3D assets generated by existing methods have low diversity among different seeds and are prone to produce similar results. To solve these problems, the author proposes HexaGen3D, a new text - to - 3D generation model. This model significantly improves the generation efficiency and quality in the following ways: - **Utilizing pre - trained 2D diffusion models**: HexaGen3D adopts a pre - trained text - to - image model and fine - tunes it to predict six orthogonal projection views and the corresponding tri - plane latent representations. These latent representations are then decoded into textured meshes. - **Fast generation**: HexaGen3D can generate high - quality and diverse 3D objects within 7 seconds, greatly shortening the generation time. - **Strong generalization ability**: This model can handle various text prompts well and can be generalized to new objects or combinations not seen during training. Through these improvements, HexaGen3D not only far exceeds existing methods in generation speed, but also performs excellently in terms of quality and diversity, thus providing a more efficient and practical solution for 3D asset generation. ### Key contributions 1. **Introducing the "Orthogonal Six - View Guidance" technique**: Bridge 2D and 3D synthesis tasks by predicting six orthogonal views. 2. **Efficient feed - forward generation**: Generate 3D assets directly from feed - forward without per - sample optimization. 3. **Significantly improve generation speed**: Compared with existing methods, the generation speed of HexaGen3D is several orders of magnitude faster. ### Experimental results Experiments show that HexaGen3D outperforms other methods in multiple metrics, including generation time and visual quality. Specifically: - **Generation time**: HexaGen3D only takes 7 seconds, while MVDream takes about 3 hours. - **Visual quality**: A user preference survey shows that HexaGen3D scores higher in visual quality and text - prompt fidelity. - **Diversity**: The 3D objects generated by HexaGen3D have higher diversity. In summary, HexaGen3D provides a faster, higher - quality, and more diverse text - to - 3D generation method, solving the shortcomings of existing methods in terms of generation time and diversity.