Abstract:Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-fidelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in fine-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-fine pipeline dubbed HeadSculpt for crafting (i.e., generating and editing) 3D head avatars from textual prompts. Specifically, we first equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase HeadSculpt's superior fidelity and editing capabilities through comprehensive experiments and comparisons with existing methods.

What problem does this paper attempt to address?

The paper attempts to address two main issues: 1. **Consistency and geometric distortion issues in generating high-fidelity 3D portraits**: Existing text-guided 3D generation methods mostly rely on pre-trained text-to-image diffusion models, which lack the necessary 3D awareness and head priors. This results in generated 3D portraits being inconsistent from different viewpoints and prone to geometric distortions. 2. **Lack of fine-grained editing capabilities**: Current methods have limitations in performing fine-grained edits, primarily due to the inherent constraints inherited from pre-trained 2D image diffusion models. These issues become more pronounced when dealing with 3D portraits, leading to a loss of identity features or unsatisfactory editing effects during the editing process. To address these challenges, the paper proposes a new coarse-to-fine generation pipeline called HeadSculpt for generating and editing 3D portraits from text prompts. Specifically, HeadSculpt improves through the following two approaches: - **Prior-driven Score Distillation**: By integrating landmark-based ControlNet and learned text embeddings to represent the appearance of the back of the head, the diffusion model is endowed with 3D awareness, thereby generating 3D consistent portraits. - **Identity-aware Editing Score Distillation (IESD)**: A new editing strategy is proposed, which optimizes the texture mesh with high-resolution differentiable rendering techniques, achieving the ability to maintain identity features while adhering to editing instructions. Through these innovations, HeadSculpt excels in generating high-quality 3D portraits and achieving fine-grained edits. It can handle various types of portraits, including humans, celebrities, non-human characters, etc., and can perform local modifications, shape/texture adjustments, and style transformations based on simple descriptions or instructions.

HeadSculpt: Crafting 3D Head Avatars with Text

HeadEvolver: Text to Head Avatars via Expressive and Attribute-Preserving Mesh Deformation

Articulated 3D Head Avatar Generation using Text-to-Image Diffusion Models

HeadStudio: Text to Animatable Head Avatars with 3D Gaussian Splatting

AvatarStudio: Text-driven Editing of 3D Dynamic Human Head Avatars

Guide3D: Create 3D Avatars from Text and Image Guidance

AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text

Text-Guided 3D Face Synthesis -- From Generation to Editing

X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

SEEAvatar: Photorealistic Text-to-3D Avatar Generation with Constrained Geometry and Appearance

GANHead: Towards Generative Animatable Neural Head Avatars

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

TimeWalker: Personalized Neural Space for Lifelong Head Avatars

GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image

En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data

DynamicAvatars: Accurate Dynamic Facial Avatars Reconstruction and Precise Editing with Diffusion Models

ID-to-3D: Expressive ID-guided 3D Heads via Score Distillation Sampling