S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing

Guangzhi Wang,Tianyi Chen,Kamran Ghasedi,HsiangTao Wu,Tianyu Ding,Chris Nuesmeyer,Ilya Zharkov,Mohan Kankanhalli,Luming Liang
2024-04-12
Abstract:Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key challenges faced by current facial video editing methods in generating high - quality results: identity preservation, editing fidelity, and temporal consistency. Specifically, existing methods have the following deficiencies: 1. **Insufficient training supervision**: Due to the lack of paired data, there is insufficient supervision for facial video editing. 2. **Sub - optimal architecture design**: Existing model architectures are not sufficient to handle diverse editing requirements. 3. **Ineffective optimization strategies**: The excessive involvement of redundant neurons leads to unexpected modifications in non - target areas (i.e., over - editing). To solve these problems, the authors propose a new framework named S3Editor, which comprehensively addresses these challenges through the following three key contributions: 1. **Self - Training Paradigm**: - Adopt a semi - supervised learning method to enhance the training process by generating pseudo - edited samples, thereby improving the generalization ability of the model. Specifically, given a facial representation \(x\), randomly select an attribute \(a\) and embed its semantic representation \(a\in\mathbb{R}^d\), and then perform the pseudo - editing step: \[ \hat{x}\leftarrow T(\text{Denormalize}(\text{Normalize}(x)+\gamma\cdot a)) \] where \(\gamma\) is a randomly selected editing intensity parameter, and \(T(\cdot)\) is a learnable transformation function. The edited latent representation \(\hat{x}\) is used to generate the edited facial image \(\hat{f}\leftarrow D(\hat{x})\). To ensure the fidelity of the edit, the following loss function is designed: \[ L_{\text{overall}}:=\lambda_{\text{id}}L_{\text{id}}+\lambda_{\text{faith}}L_{\text{faith}}+\lambda_{\text{gen}}L_{\text{gen}} \] where, \[ L_{\text{id}}:=\|\text{Arcface}(f)-\text{Arcface}(\hat{f})\| \] \[ L_{\text{faith}}:=\sum_{a'\in A,a'\neq a}\left(\|[\text{Attr}(f)]_{a'}-[\text{Attr}(\hat{f})]_{a'}\|-\gamma\|[\text{Attr}(f)]_a - [\text{Attr}(\hat{f})]_a\|\right) \] 2. **Semantic Disentangled Editing Architecture**: - Propose a dynamic routing mechanism to classify different editing requirements into multiple clusters and learn specific transformation functions for each cluster. This method can handle diverse editing tasks more flexibly and avoid the limitations of a single transformation. 3. **Structured Sparse Optimization Schema**: - Avoid over - editing by identifying and deactivating malicious neurons. Specifically, divide the facial latent representation into multiple regions and encourage structural sparsity during the training process to ensure that only the target regions are modified. The optimization problem can be expressed as: \[ \min_{\theta\in\mathbb{R}^n}L_{\text{overall}}\quad\text{s.t.}\quad