Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Xiang,Zelong Lv,Sicheng Xu,Yu Deng,Ruicheng Wang,Bowen Zhang,Dong Chen,Xin Tong,Jiaolong Yang

2024-12-02

Abstract:We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the field of 3D generation, the existing 3D generation models are still far inferior to 2D image generation models in terms of generation quality, and lack a unified framework to support high - quality generation of multiple 3D representation formats (such as Radiance Fields, 3D Gaussians and meshes). Specifically, the existing methods either perform poorly in geometric modeling, or have defects in appearance details, or cannot flexibly adapt to different downstream requirements. In addition, these models usually require complex pre - processing steps, such as the alignment process of 3D data with specific representations, which increases the training cost and complexity. To meet these challenges, the paper proposes a new 3D generation method, introducing a unified Structured Latents (SLAT), which can be decoded into different 3D representation formats while maintaining high - quality geometric and appearance information. Through this method, the author aims to develop an efficient, high - quality, multi - purpose 3D generation model that can provide flexible support in different application scenarios without the need for special adaptation or pre - processing for each specific 3D representation.

Structured 3D Latents for Scalable and Versatile 3D Generation

3D-Aware Image Synthesis Via Learning Structural and Textural Representations

StructLDM: Structured Latent Diffusion for 3D Human Generation

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

AutoDecoding Latent 3D Diffusion Models

LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis

Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

LDM: Large Tensorial SDF Model for Textured Mesh Generation

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images

GVGEN: Text-to-3D Generation with Volumetric Representation

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

ScalingGaussian: Enhancing 3D Content Creation with Generative Gaussian Splatting

DreamScape: 3D Scene Creation via Gaussian Splatting joint Correlation Modeling

LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation