HiFiVFS: High Fidelity Video Face Swapping

Xu Chen,Keke He,Junwei Zhu,Yanhao Ge,Wei Li,Chengjie Wang
2024-11-27
Abstract:Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to perform high - fidelity face swapping in videos. That is, while maintaining the identity characteristics of the source face, combine the identity characteristics of the source face with the attribute characteristics (such as pose, expression, illumination, and background) in the target video to generate high - quality and stable videos. Existing methods mainly focus on face swapping in static images. When dealing with videos, each frame is usually processed independently, making it difficult to ensure temporal stability. In addition, current methods have deficiencies in managing fine - grained attributes, identity control, and simultaneously maintaining high - quality generation. For this reason, the paper proposes a high - fidelity video face - swapping framework HiFiVFS based on the diffusion model, aiming to overcome these challenges and achieve high - quality and stable results especially in complex scenarios such as extreme postures, facial expressions, illumination conditions, makeup, and occlusions. Specifically, the main contributions of the paper include: 1. Proposing a high - fidelity video face - swapping method HiFiVFS, which can continuously generate high - fidelity face - swapping videos in extremely challenging scenarios. As far as the author knows, this is the first attempt to improve temporal stability within the face - swapping framework. 2. Introducing fine - grained attribute learning (FAL) and detailed identity learning (DIL), which significantly enhance the control ability of fine - grained attributes and identities. 3. A large number of experiments show that HiFiVFS outperforms other state - of - the - art face - swapping methods in wild - face videos in various scenarios. Through these improvements, HiFiVFS not only performs excellently on static images but also can provide more stable and high - quality output in video processing.