Abstract:Recent breakthroughs in text-to-image generation has shown encouraging results via large generative models. Due to the scarcity of 3D assets, it is hardly to transfer the success of text-to-image generation to that of text-to-3D generation. Existing text-to-3D generation methods usually adopt the paradigm of DreamFusion, which conducts per-asset optimization by distilling a pretrained text-to-image diffusion model. The generation speed usually ranges from several minutes to tens of minutes per 3D asset, which degrades the user experience and also imposes a burden to the service providers due to the high computational budget. In this work, we present an efficient text-to-3D generation method, which requires only around 8 $ms$ to generate a 3D asset given the text prompt on a consumer graphic card. The main insight is that we exploit the images generated by a large pre-trained text-to-image diffusion model, to supervise the training of a text conditioned 3D generative adversarial network. Once the network is trained, we are able to efficiently generate a 3D asset via a single forward pass. Our method requires no 3D training data and provides an alternative approach for efficient text-to-3D generation by distilling pre-trained image diffusion models.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing text - to - 3D generation methods are inefficient and require a large amount of computing resources. Specifically, current methods are usually based on the DreamFusion paradigm and generate 3D assets by extracting knowledge from pre - trained text - to - image diffusion models. However, the generation speed of these methods usually ranges from several minutes to dozens of minutes, which not only degrades the user experience but also imposes a high - computing - cost burden on service providers. To this end, the paper proposes an efficient text - to - 3D generation method - ET3D. The core idea of this method is to use the images generated by a large pre - trained text - to - image diffusion model to supervise the training of a text - conditioned 3D generative adversarial network (GAN). Once the network is trained, 3D assets can be efficiently generated through a single forward pass, taking only about 8 milliseconds. This method does not require 3D training data and provides a new efficient way for text - to - 3D generation by distilling the knowledge of the pre - trained image diffusion model. The main contributions of the paper include: - Proposing a simple and effective text - conditioned 3D generative adversarial network; - Training the network by distilling multi - view knowledge from a pre - trained large - scale text - to - multi - view - image generation model without using SDS loss or any 3D datasets; - After training, 3D assets can be generated on consumer - grade graphics cards in only 8 milliseconds, significantly reducing the computing budget and providing a real - time experience; - Demonstrating the possibility of training an efficient general - purpose text - to - 3D generation model relying on a pre - trained large - scale text - to - multi - view - image diffusion model; - Emphasizing the importance of exploring the direction of efficient text - to - 3D content generation by leveraging pre - trained text - to - multi - view base models.

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

Instant3D: Instant Text-to-3D Generation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

ScaleDreamer: Scalable Text-to-3D Synthesis with Asynchronous Score Distillation

Control3D: Towards Controllable Text-to-3D Generation

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Retrieval-Augmented Score Distillation for Text-to-3D Generation

Magic3D: High-Resolution Text-to-3D Content Creation

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

DreamFusion: Text-to-3D using 2D Diffusion