Abstract:Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to perform high - fidelity face swapping in videos. That is, while maintaining the identity characteristics of the source face, combine the identity characteristics of the source face with the attribute characteristics (such as pose, expression, illumination, and background) in the target video to generate high - quality and stable videos. Existing methods mainly focus on face swapping in static images. When dealing with videos, each frame is usually processed independently, making it difficult to ensure temporal stability. In addition, current methods have deficiencies in managing fine - grained attributes, identity control, and simultaneously maintaining high - quality generation. For this reason, the paper proposes a high - fidelity video face - swapping framework HiFiVFS based on the diffusion model, aiming to overcome these challenges and achieve high - quality and stable results especially in complex scenarios such as extreme postures, facial expressions, illumination conditions, makeup, and occlusions. Specifically, the main contributions of the paper include: 1. Proposing a high - fidelity video face - swapping method HiFiVFS, which can continuously generate high - fidelity face - swapping videos in extremely challenging scenarios. As far as the author knows, this is the first attempt to improve temporal stability within the face - swapping framework. 2. Introducing fine - grained attribute learning (FAL) and detailed identity learning (DIL), which significantly enhance the control ability of fine - grained attributes and identities. 3. A large number of experiments show that HiFiVFS outperforms other state - of - the - art face - swapping methods in wild - face videos in various scenarios. Through these improvements, HiFiVFS not only performs excellently on static images but also can provide more stable and high - quality output in video processing.

HiFiVFS: High Fidelity Video Face Swapping

VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping

Region-Aware Face Swapping

DiffSwap: High-Fidelity and Controllable Face Swapping Via 3D-Aware Masked Diffusion

FaceSwapNet: Landmark Guided Many-to-Many Face Reenactment

Designing One Unified Framework for High-Fidelity Face Reenactment and Swapping

A high-fidelity face swapping algorithm based on mutual information-guided feature decoupling

HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping

A GAN-Based Framework for High-Fidelity Face Swapping.

Face Swapping with Adaptive Exploration-Fusion Mechanism and Dual En-Decoing Tactic

FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping

High-Fidelity Face Swapping with Style Blending

Deep Face Swapping via Cross-Identity Adversarial Training.

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

SimSwap: An Efficient Framework For High Fidelity Face Swapping

Identity-Preserving Face Swapping via Dual Surrogate Generative Models

Unified Video and Image Representation for Boosted Video Face Forgery Detection

StableSwap: Stable Face Swapping in a Shared and Controllable Latent Space

High-resolution Face Swapping via Latent Semantics Disentanglement

MobileFaceSwap: A Lightweight Framework for Video Face Swapping

Realistic and Efficient Face Swapping: A Unified Approach with Diffusion Models