HeadSculpt: Crafting 3D Head Avatars with Text

Xiao Han,Yukang Cao,Kai Han,Xiatian Zhu,Jiankang Deng,Yi-Zhe Song,Tao Xiang,Kwan-Yee K. Wong
2023-08-29
Abstract:Recently, text-guided 3D generative methods have made remarkable advancements in producing high-quality textures and geometry, capitalizing on the proliferation of large vision-language and image diffusion models. However, existing methods still struggle to create high-fidelity 3D head avatars in two aspects: (1) They rely mostly on a pre-trained text-to-image diffusion model whilst missing the necessary 3D awareness and head priors. This makes them prone to inconsistency and geometric distortions in the generated avatars. (2) They fall short in fine-grained editing. This is primarily due to the inherited limitations from the pre-trained 2D image diffusion models, which become more pronounced when it comes to 3D head avatars. In this work, we address these challenges by introducing a versatile coarse-to-fine pipeline dubbed HeadSculpt for crafting (i.e., generating and editing) 3D head avatars from textual prompts. Specifically, we first equip the diffusion model with 3D awareness by leveraging landmark-based control and a learned textual embedding representing the back view appearance of heads, enabling 3D-consistent head avatar generations. We further propose a novel identity-aware editing score distillation strategy to optimize a textured mesh with a high-resolution differentiable rendering technique. This enables identity preservation while following the editing instruction. We showcase HeadSculpt's superior fidelity and editing capabilities through comprehensive experiments and comparisons with existing methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address two main issues: 1. **Consistency and geometric distortion issues in generating high-fidelity 3D portraits**: Existing text-guided 3D generation methods mostly rely on pre-trained text-to-image diffusion models, which lack the necessary 3D awareness and head priors. This results in generated 3D portraits being inconsistent from different viewpoints and prone to geometric distortions. 2. **Lack of fine-grained editing capabilities**: Current methods have limitations in performing fine-grained edits, primarily due to the inherent constraints inherited from pre-trained 2D image diffusion models. These issues become more pronounced when dealing with 3D portraits, leading to a loss of identity features or unsatisfactory editing effects during the editing process. To address these challenges, the paper proposes a new coarse-to-fine generation pipeline called HeadSculpt for generating and editing 3D portraits from text prompts. Specifically, HeadSculpt improves through the following two approaches: - **Prior-driven Score Distillation**: By integrating landmark-based ControlNet and learned text embeddings to represent the appearance of the back of the head, the diffusion model is endowed with 3D awareness, thereby generating 3D consistent portraits. - **Identity-aware Editing Score Distillation (IESD)**: A new editing strategy is proposed, which optimizes the texture mesh with high-resolution differentiable rendering techniques, achieving the ability to maintain identity features while adhering to editing instructions. Through these innovations, HeadSculpt excels in generating high-quality 3D portraits and achieving fine-grained edits. It can handle various types of portraits, including humans, celebrities, non-human characters, etc., and can perform local modifications, shape/texture adjustments, and style transformations based on simple descriptions or instructions.