Abstract:We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data, limiting them to single or few-class generation, our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets, mitigating the key challenge (i.e., data scarcity) in large-scale 3D generation. In particular, DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically, after an initial warm-up phase using a small set of clean data, an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input, our model generates high-quality, high-resolution, realistic, and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects, for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. The code and models are available for research purposes at: <a class="link-external link-https" href="https://github.com/qihao067/direct3d" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the difficulty in training large - scale 3D generation models caused by the scarcity of 3D data and the uneven quality. Specifically, existing 3D generation models usually rely on high - quality and well - aligned 3D datasets, such as ShapeNet, which limits them to generating objects of a single category or a few categories, and the diversity of the generated objects is limited. In addition, these models perform poorly when dealing with large - scale, noisy and unaligned "wild" 3D data. To overcome these problems, the paper proposes DIRECT - 3D, a 3D generation model based on the diffusion model, which can be directly trained end - to - end on a large amount of noisy and unaligned "wild" 3D data. By introducing an iterative optimization process, DIRECT - 3D can automatically filter and align data, thereby effectively using these large - scale 3D datasets. In addition, DIRECT - 3D also improves the quality and efficiency of generating 3D objects by decoupling geometric and color features. Specific technological innovations include: 1. **Iterative optimization process**: Explicitly estimate the pose of 3D objects and select useful data during the diffusion process, thereby achieving automatic data cleaning and alignment. 2. **Decoupling geometric and color features**: Use two independent conditional diffusion models to generate geometric and color features respectively, which improves the efficiency and generation ability of the model. 3. **Automatically generating descriptive titles**: Generate descriptive titles of different granularities, enhancing the alignment between text prompts and generated 3D objects. Through these technological innovations, DIRECT - 3D not only performs well in single - category generation tasks, but also achieves state - of - the - art performance in large - scale text - to - 3D generation tasks. In addition, DIRECT - 3D can also be used as a 3D geometric prior, significantly improving the geometric consistency problem in existing 2D lifting methods.

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion

Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Prior

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

3DTopia: Large Text-to-3D Generation Model with Hybrid Diffusion Priors

Magic3D: High-Resolution Text-to-3D Content Creation

Enhanced 3D Generation by 2D Editing

Creating High-quality 3D Content by Bridging the Gap Between Text-to-2D and Text-to-3D Generation

HexaGen3D: StableDiffusion is just one step away from Fast and Diverse Text-to-3D Generation

GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

DreamFusion: Text-to-3D using 2D Diffusion

IT3D: Improved Text-to-3D Generation with Explicit View Synthesis

GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation