Abstract:Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key challenges faced by current facial video editing methods in generating high - quality results: identity preservation, editing fidelity, and temporal consistency. Specifically, existing methods have the following deficiencies: 1. **Insufficient training supervision**: Due to the lack of paired data, there is insufficient supervision for facial video editing. 2. **Sub - optimal architecture design**: Existing model architectures are not sufficient to handle diverse editing requirements. 3. **Ineffective optimization strategies**: The excessive involvement of redundant neurons leads to unexpected modifications in non - target areas (i.e., over - editing). To solve these problems, the authors propose a new framework named S3Editor, which comprehensively addresses these challenges through the following three key contributions: 1. **Self - Training Paradigm**: - Adopt a semi - supervised learning method to enhance the training process by generating pseudo - edited samples, thereby improving the generalization ability of the model. Specifically, given a facial representation \(x\), randomly select an attribute \(a\) and embed its semantic representation \(a\in\mathbb{R}^d\), and then perform the pseudo - editing step: \[ \hat{x}\leftarrow T(\text{Denormalize}(\text{Normalize}(x)+\gamma\cdot a)) \] where \(\gamma\) is a randomly selected editing intensity parameter, and \(T(\cdot)\) is a learnable transformation function. The edited latent representation \(\hat{x}\) is used to generate the edited facial image \(\hat{f}\leftarrow D(\hat{x})\). To ensure the fidelity of the edit, the following loss function is designed: \[ L_{\text{overall}}:=\lambda_{\text{id}}L_{\text{id}}+\lambda_{\text{faith}}L_{\text{faith}}+\lambda_{\text{gen}}L_{\text{gen}} \] where, \[ L_{\text{id}}:=\|\text{Arcface}(f)-\text{Arcface}(\hat{f})\| \] \[ L_{\text{faith}}:=\sum_{a'\in A,a'\neq a}\left(\|[\text{Attr}(f)]_{a'}-[\text{Attr}(\hat{f})]_{a'}\|-\gamma\|[\text{Attr}(f)]_a - [\text{Attr}(\hat{f})]_a\|\right) \] 2. **Semantic Disentangled Editing Architecture**: - Propose a dynamic routing mechanism to classify different editing requirements into multiple clusters and learn specific transformation functions for each cluster. This method can handle diverse editing tasks more flexibly and avoid the limitations of a single transformation. 3. **Structured Sparse Optimization Schema**: - Avoid over - editing by identifying and deactivating malicious neurons. Specifically, divide the facial latent representation into multiple regions and encourage structural sparsity during the training process to ensure that only the target regions are modified. The optimization problem can be expressed as: \[ \min_{\theta\in\mathbb{R}^n}L_{\text{overall}}\quad\text{s.t.}\quad

S3Editor: A Sparse Semantic-Disentangled Self-Training Framework for Face Video Editing

Task-agnostic Temporally Consistent Facial Video Editing

A Reference-Based 3D Semantic-Aware Framework for Accurate Local Facial Attribute Editing

IA-FaceS: A bidirectional method for semantic face editing

DeepFaceVideoEditing

DisControlFace: Adding Disentangled Control to Diffusion Autoencoder for One-shot Explicit Facial Image Editing

Editable-DeepSC: Reliable Cross-Modal Semantic Communications for Facial Editing

UniFaceGAN: A Unified Framework for Temporally Consistent Facial Video Editing

Spatio-Temporal Video Segmentation of Static Scenes and Its Applications

DeepFaceVideoEditing: Sketch-based Deep Editing of Face Videos

DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

3D-Aware Face Editing Via Warping-Guided Latent Direction Learning

Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding

Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting

IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis

SDGAN: Disentangling Semantic Manipulation for Facial Attribute Editing

Disentangled face editing via individual walk in personalized facial semantic field

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

FACEMUG: A Multimodal Generative and Fusion Framework for Local Facial Editing

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning