ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

Yiming Chen,Zhiqi Li,Peidong Liu
DOI: https://doi.org/10.48550/arXiv.2311.15561
2023-11-27
Abstract:Recent breakthroughs in text-to-image generation has shown encouraging results via large generative models. Due to the scarcity of 3D assets, it is hardly to transfer the success of text-to-image generation to that of text-to-3D generation. Existing text-to-3D generation methods usually adopt the paradigm of DreamFusion, which conducts per-asset optimization by distilling a pretrained text-to-image diffusion model. The generation speed usually ranges from several minutes to tens of minutes per 3D asset, which degrades the user experience and also imposes a burden to the service providers due to the high computational budget. In this work, we present an efficient text-to-3D generation method, which requires only around 8 $ms$ to generate a 3D asset given the text prompt on a consumer graphic card. The main insight is that we exploit the images generated by a large pre-trained text-to-image diffusion model, to supervise the training of a text conditioned 3D generative adversarial network. Once the network is trained, we are able to efficiently generate a 3D asset via a single forward pass. Our method requires no 3D training data and provides an alternative approach for efficient text-to-3D generation by distilling pre-trained image diffusion models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing text - to - 3D generation methods are inefficient and require a large amount of computing resources. Specifically, current methods are usually based on the DreamFusion paradigm and generate 3D assets by extracting knowledge from pre - trained text - to - image diffusion models. However, the generation speed of these methods usually ranges from several minutes to dozens of minutes, which not only degrades the user experience but also imposes a high - computing - cost burden on service providers. To this end, the paper proposes an efficient text - to - 3D generation method - ET3D. The core idea of this method is to use the images generated by a large pre - trained text - to - image diffusion model to supervise the training of a text - conditioned 3D generative adversarial network (GAN). Once the network is trained, 3D assets can be efficiently generated through a single forward pass, taking only about 8 milliseconds. This method does not require 3D training data and provides a new efficient way for text - to - 3D generation by distilling the knowledge of the pre - trained image diffusion model. The main contributions of the paper include: - Proposing a simple and effective text - conditioned 3D generative adversarial network; - Training the network by distilling multi - view knowledge from a pre - trained large - scale text - to - multi - view - image generation model without using SDS loss or any 3D datasets; - After training, 3D assets can be generated on consumer - grade graphics cards in only 8 milliseconds, significantly reducing the computing budget and providing a real - time experience; - Demonstrating the possibility of training an efficient general - purpose text - to - 3D generation model relying on a pre - trained large - scale text - to - multi - view - image diffusion model; - Emphasizing the importance of exploring the direction of efficient text - to - 3D content generation by leveraging pre - trained text - to - multi - view base models.