DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image Editing

Haozhe Jia,Yan Li,Hengfei Cui,Di Xu,Yuwang Wang,Tao Yu
2024-07-24
Abstract:In this work, we focus on exploring explicit fine-grained control of generative facial image editing, all while generating faithful facial appearances and consistent semantic details, which however, is quite challenging and has not been extensively explored, especially under an one-shot scenario. We identify the key challenge as the exploration of disentangled conditional control between high-level semantics and explicit parameters (e.g., 3DMM) in the generation process, and accordingly propose a novel diffusion-based editing framework, named DisControlFace. Specifically, we leverage a Diffusion Autoencoder (Diff-AE) as the semantic reconstruction backbone. To enable explicit face editing, we construct an Exp-FaceNet that is compatible with Diff-AE to generate spatial-wise explicit control conditions based on estimated 3DMM parameters. Different from current diffusion-based editing methods that train the whole conditional generative model from scratch, we freeze the pre-trained weights of the Diff-AE to maintain its semantically deterministic conditioning capability and accordingly propose a random semantic masking (RSM) strategy to effectively achieve an independent training of Exp-FaceNet. This setting endows the model with disentangled face control meanwhile reducing semantic information shift in editing. Our model can be trained using 2D in-the-wild portrait images without requiring 3D or video data and perform robust editing on any new facial image through a simple one-shot fine-tuning. Comprehensive experiments demonstrate that DisControlFace can generate realistic facial images with better editing accuracy and identity preservation over state-of-the-art methods. Project page: <a class="link-external link-https" href="https://discontrolface.github.io/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper primarily focuses on achieving explicit fine-grained control in generating facial image edits while maintaining realistic facial appearance and consistent semantic details. Specifically, the authors propose a new diffusion model-based editing framework—DisControlFace, to address the following key issues: 1. **Explicit Fine-Grained Control**: Current methods often fail to effectively separate high-level semantic information from explicit parameter control (such as 3DMM) during facial attribute editing. DisControlFace achieves this separation by combining Diffusion Autoencoder (Diff-AE) with a specially constructed Exp-FaceNet. 2. **Editing Capability with Single Fine-Tuning**: DisControlFace can perform robust editing operations on any new facial image with fine-tuning using only one input image, thus avoiding the need for large amounts of training data. 3. **Identity Consistency**: Existing methods tend to exhibit identity shifts and other irrelevant detail changes during the editing process. DisControlFace preserves the identity features of the input image better by freezing the pre-trained Diff-AE weights and employing a Random Semantic Mask (RSM) strategy. 4. **Compatibility and Extensibility**: DisControlFace supports not only facial image editing tasks but also other editing tasks such as image inpainting and semantic attribute manipulation. In summary, the paper aims to achieve explicit control in facial image editing by introducing a new diffusion model framework that maintains good identity consistency with single fine-tuning.