Abstract:In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful facial appearances and consistent semantic details, which however, is quite challenging and has not been extensively explored, especially under an one-shot scenario. We identify the key challenge as the exploration of disentangled conditional control between high-level semantics and explicit parameters (e.g., 3DMM) in the generation process, and accordingly propose a novel diffusion-based editing framework, named DisControlFace. Specifically, we leverage a Diffusion Autoencoder (Diff-AE) as the semantic reconstruction backbone. To enable explicit face editing, we construct an Exp-FaceNet that is compatible with Diff-AE to generate spatial-wise explicit control conditions based on estimated 3DMM parameters. Different from current diffusion-based editing methods that train the whole conditional generative model from scratch, we freeze the pre-trained weights of the Diff-AE to maintain its semantically deterministic conditioning capability and accordingly propose a random semantic masking (RSM) strategy to effectively achieve an independent training of Exp-FaceNet. This setting endows the model with disentangled face control meanwhile reducing semantic information shift in editing. Our model can be trained using 2D in-the-wild portrait images without requiring 3D or video data and perform robust editing on any new facial image through a simple one-shot fine-tuning. Comprehensive experiments demonstrate that DisControlFace can generate realistic facial images with better editing accuracy and identity preservation over state-of-the-art methods. Project page: <a class="link-external link-https" href="https://discontrolface.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper primarily focuses on achieving explicit fine-grained control in generating facial image edits while maintaining realistic facial appearance and consistent semantic details. Specifically, the authors propose a new diffusion model-based editing framework—DisControlFace, to address the following key issues: 1. **Explicit Fine-Grained Control**: Current methods often fail to effectively separate high-level semantic information from explicit parameter control (such as 3DMM) during facial attribute editing. DisControlFace achieves this separation by combining Diffusion Autoencoder (Diff-AE) with a specially constructed Exp-FaceNet. 2. **Editing Capability with Single Fine-Tuning**: DisControlFace can perform robust editing operations on any new facial image with fine-tuning using only one input image, thus avoiding the need for large amounts of training data. 3. **Identity Consistency**: Existing methods tend to exhibit identity shifts and other irrelevant detail changes during the editing process. DisControlFace preserves the identity features of the input image better by freezing the pre-trained Diff-AE weights and employing a Random Semantic Mask (RSM) strategy. 4. **Compatibility and Extensibility**: DisControlFace supports not only facial image editing tasks but also other editing tasks such as image inpainting and semantic attribute manipulation. In summary, the paper aims to achieve explicit control in facial image editing by introducing a new diffusion model framework that maintains good identity consistency with single fine-tuning.

DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image Editing

DisControlFace: Disentangled Control for Personalized Facial Image Editing

Realistic Face Reenactment Via Self-Supervised Disentangling of Identity and Pose

Controllable Face Image Editing in a Disentanglement Way

DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

DECap: Towards Generalized Explicit Caption Editing Via Diffusion Mechanism

Disentangled face editing via individual walk in personalized facial semantic field

IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis

FaceController: Controllable Attribute Editing for Face in the Wild

DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

KDDGAN: Knowledge-Guided Explicit Feature Disentanglement for Facial Attribute Editing

FaceDNeRF: Semantics-Driven Face Reconstruction, Prompt Editing and Relighting with Diffusion Models

Adaptive Nonlinear Latent Transformation for Conditional Face Editing

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing

MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

ControlFace: Feature Disentangling for Controllable Face Swapping.

Collaborative Diffusion for Multi-Modal Face Generation and Editing