Abstract:3D face editing is a significant task in multimedia, aimed at the manipulation of 3D face models across various control signals. The success of 3D-aware GAN provides expressive 3D models learned from 2D single-view images only, encouraging researchers to discover semantic editing directions in its latent space. However, previous methods face challenges in balancing quality, efficiency, and generalization. To solve the problem, we explore the possibility of introducing the strength of diffusion model into 3D-aware GANs. In this paper, we present Face Clan, a fast and text-general approach for generating and manipulating 3D faces based on arbitrary attribute descriptions. To achieve disentangled editing, we propose to diffuse on the latent space under a pair of opposite prompts to estimate the mask indicating the region of interest on latent codes. Based on the mask, we then apply denoising to the masked latent codes to reveal the editing direction. Our method offers a precisely controllable manipulation method, allowing users to intuitively customize regions of interest with the text description. Experiments demonstrate the effectiveness and generalization of our Face Clan for various pre-trained GANs. It offers an intuitive and wide application for text-guided face editing that contributes to the landscape of multimedia content creation.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the challenges in 3D face editing. Specifically: 1. **Balancing quality, efficiency and generalization ability**: Previous methods in 3D face editing have difficulty simultaneously ensuring high - quality editing effects, efficient editing speed and wide applicability. The author points out that supervised methods require a large amount of labeled data and are time - consuming, while unsupervised methods are identity - sensitive and it is difficult to find any semantic directions required by users. 2. **Precise control and identity preservation**: During the editing process, how to change only the target area (such as texture or geometric features) without affecting other parts and maintain the consistency of the original identity is a key issue. Existing methods perform poorly when dealing with complex attributes (such as hairstyles, hats, etc.), especially in color and texture editing. To solve these problems, the author proposes a fast and general text - guided 3D face editing method based on the diffusion model - **Face Clan**. This method is achieved through the following steps: - **Introducing the diffusion model**: Apply the diffusion model to the latent space of GAN to align the distribution of text conditions and latent codes. The diffusion model enhances the diversity and consistency of the text - to - latent - space mapping through multi - step cumulative deviations. - **Estimating the direction mask**: By analyzing the difference in predicted noise under a pair of opposite descriptions (such as "wearing a hat" and "not wearing a hat"), estimate a mask to indicate the region of interest in the latent code. This allows the editing operation to focus on a specific area while preserving the rest. - **Denoising operation**: Apply the denoising process in the mask area to reveal the editing direction. This method allows users to intuitively customize the area of interest according to the text description, achieving precisely controllable editing. Experimental results show that Face Clan performs well on a variety of pre - trained GANs and can achieve high - quality, efficient and widely applicable text - guided 3D face editing, especially suitable for the field of multimedia content creation.

Revealing Directions for Text-guided 3D Face Editing

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

Text-Guided 3D Face Synthesis -- From Generation to Editing

Zero-shot Text-driven Physically Interpretable Face Editing

Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images

FEAT: Face Editing with Attention

3D-Aware Face Editing Via Warping-Guided Latent Direction Learning

Designing a 3D-Aware StyleNeRF Encoder for Face Editing

IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis

MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

Text-driven Face Image Generation and Manipulation via Multi-level Residual Mapper

Text-conditional Attribute Alignment Across Latent Spaces for 3D Controllable Face Image Synthesis

Mask-guided GAN for robust text editing in the scene

DF-CLIP: Towards Disentangled and Fine-grained Image Editing from Text

DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image Editing

Efficient Text-Guided 3D-Aware Portrait Generation with Score Distillation Sampling on Distribution

DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance

Disentangled face editing via individual walk in personalized facial semantic field

DFIE3D: 3D-Aware Disentangled Face Inversion and Editing Via Facial-contrastive Learning

DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors