3DGen: Triplane Latent Diffusion for Textured Mesh Generation

Anchit Gupta,Wenhan Xiong,Yixin Nie,Ian Jones,Barlas Oğuz
DOI: https://doi.org/10.48550/arXiv.2303.05371
2023-03-28
Abstract:Latent diffusion models for image generation have crossed a quality threshold which enabled them to achieve mass adoption. Recently, a series of works have made advancements towards replicating this success in the 3D domain, introducing techniques such as point cloud VAE, triplane representation, neural implicit surfaces and differentiable rendering based training. We take another step along this direction, combining these developments in a two-step pipeline consisting of 1) a triplane VAE which can learn latent representations of textured meshes and 2) a conditional diffusion model which generates the triplane features. For the first time this architecture allows conditional and unconditional generation of high quality textured or untextured 3D meshes across multiple diverse categories in a few seconds on a single GPU. It outperforms previous work substantially on image-conditioned and unconditional generation on mesh quality as well as texture generation. Furthermore, we demonstrate the scalability of our model to large datasets for increased quality and diversity. We will release our code and trained models.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper "3DGen: Triplane Latent Diffusion for Textured Mesh Generation" aims to solve the problem of high - quality 3D textured mesh generation, especially in the following aspects: 1. **High - quality 3D mesh and texture generation**: Existing 3D generation models have deficiencies in geometric shape and texture generation, especially performing poorly on multi - class data. This paper proposes a new architecture that can generate high - quality 3D meshes with or without textures and can perform well on multiple different classes. 2. **Conditional and unconditional generation**: Existing 3D generation models have limited effectiveness in conditional generation (such as image - or text - based generation) and unconditional generation. The model proposed in this paper can quickly generate high - quality 3D meshes on a single GPU, supporting image - and text - conditional generation as well as unconditional generation. 3. **Scalability**: Existing 3D generation models can usually only handle small - scale datasets or objects of specific classes. By introducing large - scale pre - training, this paper demonstrates the superior performance of the model on larger - scale datasets, improving the generation quality and diversity. 4. **Joint geometric and color representation**: How to effectively represent geometric shapes and color information jointly is an open problem. In this paper, by using the triplane representation, combined with the variational auto - encoder (VAE) and the diffusion model, efficient learning and generation of geometric shape and color information are achieved. ### Method overview To achieve the above goals, the paper proposes a two - stage generation framework: 1. **First stage: Triplane VAE** - Use the variational auto - encoder (VAE) to encode the input colored point cloud into the triplane latent space and decode a continuous textured mesh from it. - Train through the rendering loss, avoiding the need for complex pre - processing steps (such as water - tightening) in traditional methods to obtain SDF or occupancy values. 2. **Second stage: Conditional diffusion model** - Use the diffusion model to generate triplane features, which can be conditionally generated based on the input image - text embeddings. - The diffusion model learns to generate high - quality triplane features through the reverse denoising process and is finally decoded into a textured 3D mesh. ### Experimental results Experiments show that 3DGen significantly outperforms existing methods on multiple benchmark datasets, specifically: - **Unconditional geometric generation**: The FiD score is improved by 23% compared to the closest competitor NFD. - **Text - conditional geometric generation**: It is improved by 15 - 20% on different classes compared to the current best 3DILG model. - **Unconditional textured mesh generation**: The FiD score is improved by 70% compared to GET3D. In addition, the paper also demonstrates the scalability of the model on large - scale datasets, further improving the generation quality and diversity. ### Summary 3DGen successfully solves multiple challenges in high - quality 3D textured mesh generation through innovative architecture design and effective training strategies, providing a powerful tool for high - precision 3D object generation in practical applications.