Abstract:Text-to-3D is an emerging task that allows users to create 3D content with infinite possibilities. Existing works tackle the problem by optimizing a 3D representation with guidance from pre-trained diffusion models. An apparent drawback is that they need to optimize from scratch for each prompt, which is computationally expensive and often yields poor visual fidelity. In this paper, we propose DreamPortrait, which aims to generate text-guided 3D-aware portraits in a single-forward pass for efficiency. To achieve this, we extend Score Distillation Sampling from datapoint to distribution formulation, which injects semantic prior into a 3D distribution. However, the direct extension will lead to the mode collapse problem since the objective only pursues semantic alignment. Hence, we propose to optimize a distribution with hierarchical condition adapters and GAN loss regularization. For better 3D modeling, we further design a 3D-aware gated cross-attention mechanism to explicitly let the model perceive the correspondence between the text and the 3D-aware space. These elaborated designs enable our model to generate portraits with robust multi-view semantic consistency, eliminating the need for optimization-based methods. Extensive experiments demonstrate our model's highly competitive performance and significant speed boost against existing methods.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of text-to-3D-face generation, specifically on how to efficiently generate high-quality, semantically consistent 3D portraits from textual descriptions, and overcoming issues such as high computational cost and multi-view semantic inconsistency present in existing methods. To tackle these issues, the paper proposes a new framework named DreamPortrait, with its core contributions including: 1. **Efficient Generation Mechanism**: Unlike existing methods that require optimization for each input text, DreamPortrait can generate text-guided 3D portraits in a single forward pass, significantly improving efficiency. This is mainly achieved by extending the Score Distillation Sampling (SDS) method from the data point level to the distribution level. 2. **Preventing Mode Collapse**: Directly applying the SDS method to distributions may lead to mode collapse, where the generated results are overly uniform. To address this, the paper proposes two strategies: first, using hierarchical condition adapters to optimize the distribution; second, introducing GAN loss regularization to maintain 3D priors. 3. **3D-Aware Gated Attention Mechanism**: To better capture the correspondence between text and 3D space, the paper designs a 3D-aware gated cross-attention mechanism, enabling the model to more effectively understand the relationship between input text and 3D space. 4. **Experimental Validation**: Through a series of quantitative and qualitative experiments, the paper demonstrates the superior performance of DreamPortrait on multiple datasets, particularly in terms of image quality and semantic consistency, as well as the advantages in efficient generation speed and memory usage. In summary, the paper aims to propose a new efficient and high-quality text-to-3D-face generation method to address the challenges faced by existing technologies, including low computational efficiency and poor multi-view semantic consistency.

Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior

DreamTime: An Improved Optimization Strategy for Diffusion-Guided 3D Generation

VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

Freestyle 3D-Aware Portrait Synthesis Based on Compositional Generative Priors

Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

AnimatableDreamer: Text-Guided Non-rigid 3D Model Generation and Reconstruction with Canonical Score Distillation

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

ET3D: Efficient Text-to-3D Generation via Multi-View Distillation

BoostDream: Efficient Refining for High-Quality Text-to-3D Generation from Multi-View Diffusion

PlacidDreamer: Advancing Harmony in Text-to-3D Generation

Connecting Consistency Distillation to Score Distillation for Text-to-3D Generation

Retrieval-Augmented Score Distillation for Text-to-3D Generation

LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching

Text-to-3D with Classifier Score Distillation

Dream-in-Style: Text-to-3D Generation using Stylized Score Distillation

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts