Abstract:Latent diffusion models for image generation have crossed a quality threshold which enabled them to achieve mass adoption. Recently, a series of works have made advancements towards replicating this success in the 3D domain, introducing techniques such as point cloud VAE, triplane representation, neural implicit surfaces and differentiable rendering based training. We take another step along this direction, combining these developments in a two-step pipeline consisting of 1) a triplane VAE which can learn latent representations of textured meshes and 2) a conditional diffusion model which generates the triplane features. For the first time this architecture allows conditional and unconditional generation of high quality textured or untextured 3D meshes across multiple diverse categories in a few seconds on a single GPU. It outperforms previous work substantially on image-conditioned and unconditional generation on mesh quality as well as texture generation. Furthermore, we demonstrate the scalability of our model to large datasets for increased quality and diversity. We will release our code and trained models.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "3DGen: Triplane Latent Diffusion for Textured Mesh Generation" aims to solve the problem of high - quality 3D textured mesh generation, especially in the following aspects: 1. **High - quality 3D mesh and texture generation**: Existing 3D generation models have deficiencies in geometric shape and texture generation, especially performing poorly on multi - class data. This paper proposes a new architecture that can generate high - quality 3D meshes with or without textures and can perform well on multiple different classes. 2. **Conditional and unconditional generation**: Existing 3D generation models have limited effectiveness in conditional generation (such as image - or text - based generation) and unconditional generation. The model proposed in this paper can quickly generate high - quality 3D meshes on a single GPU, supporting image - and text - conditional generation as well as unconditional generation. 3. **Scalability**: Existing 3D generation models can usually only handle small - scale datasets or objects of specific classes. By introducing large - scale pre - training, this paper demonstrates the superior performance of the model on larger - scale datasets, improving the generation quality and diversity. 4. **Joint geometric and color representation**: How to effectively represent geometric shapes and color information jointly is an open problem. In this paper, by using the triplane representation, combined with the variational auto - encoder (VAE) and the diffusion model, efficient learning and generation of geometric shape and color information are achieved. ### Method overview To achieve the above goals, the paper proposes a two - stage generation framework: 1. **First stage: Triplane VAE** - Use the variational auto - encoder (VAE) to encode the input colored point cloud into the triplane latent space and decode a continuous textured mesh from it. - Train through the rendering loss, avoiding the need for complex pre - processing steps (such as water - tightening) in traditional methods to obtain SDF or occupancy values. 2. **Second stage: Conditional diffusion model** - Use the diffusion model to generate triplane features, which can be conditionally generated based on the input image - text embeddings. - The diffusion model learns to generate high - quality triplane features through the reverse denoising process and is finally decoded into a textured 3D mesh. ### Experimental results Experiments show that 3DGen significantly outperforms existing methods on multiple benchmark datasets, specifically: - **Unconditional geometric generation**: The FiD score is improved by 23% compared to the closest competitor NFD. - **Text - conditional geometric generation**: It is improved by 15 - 20% on different classes compared to the current best 3DILG model. - **Unconditional textured mesh generation**: The FiD score is improved by 70% compared to GET3D. In addition, the paper also demonstrates the scalability of the model on large - scale datasets, further improving the generation quality and diversity. ### Summary 3DGen successfully solves multiple challenges in high - quality 3D textured mesh generation through innovative architecture design and effective training strategies, providing a powerful tool for high - precision 3D object generation in practical applications.

3DGen: Triplane Latent Diffusion for Textured Mesh Generation

3D Neural Field Generation using Triplane Diffusion

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images

TPA3D: Triplane Attention for Fast Text-to-3D Generation

GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation

Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars

Compress3D: a Compressed Latent Space for 3D Generation from a Single Image

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting

Text-Driven Diverse Facial Texture Generation via Progressive Latent-Space Refinement

Guide3D: Create 3D Avatars from Text and Image Guidance

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Hash3D: Training-free Acceleration for 3D Generation

GaussianDreamer: Fast Generation from Text to 3D Gaussian Splatting with Point Cloud Priors

DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models