Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN

Jiacheng Su,Kunhong Liu,Liyan Chen,Junfeng Yao,Qingsong Liu,Dongdong Lv
2024-07-08
Abstract:The existing methods for audio-driven talking head video editing have the limitations of poor visual effects. This paper tries to tackle this problem through editing talking face images seamless with different emotions based on two modules: (1) an audio-to-landmark module, consisting of the CrossReconstructed Emotion Disentanglement and an alignment network module. It bridges the gap between speech and facial motions by predicting corresponding emotional landmarks from speech; (2) a landmark-based editing module edits face videos via StyleGAN. It aims to generate the seamless edited video consisting of the emotion and content components from the input audio. Extensive experiments confirm that compared with state-of-the-arts methods, our method provides high-resolution videos with high visual quality.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiency in visual effects of the existing audio - driven talking - head video editing methods. Specifically, the existing methods have obvious editing traces and blurring effects when generating high - resolution videos. This paper proposes a new framework to improve the resolution and visual quality of videos through StyleGAN - based editing methods, achieve seamless editing, and be able to generate different expressions according to the emotions in the input audio. ### Main Contributions 1. **Propose a new framework based on StyleGAN**: This framework can achieve high - resolution synchronous generation and seamless editing, and generate different expressions according to the emotions embedded in the input audio. 2. **Introduce an optimization algorithm**: This algorithm generates facial - edited videos through StyleGAN under the supervision of facial landmarks, while maintaining the identity of the original video characters and the smoothness of the video. 3. **Develop an audio - to - landmark module**: This module can generate facial landmarks that are aligned in emotion and posture with the target person when speaking in the audio. An effective alignment module is designed, using the cross - attention mechanism to promote this process. ### Method Overview 1. **Audio - to - Landmark Module**: - **Emotion Decoupling and Prediction**: Adopt the cross - reconstruction emotion decoupling method to extract the emotion and content components in the audio, and predict the landmark displacement through the LSTM network and a two - layer MLP. - **Alignment**: Design an alignment network, using the cross - attention mechanism to align the predicted landmarks with the original landmarks to prevent facial information leakage. 2. **Landmark - based Editing Module**: - **Inversion**: Map each video frame inversely into the latent space of the GAN, and use the PTI method to adjust the generator to reconstruct the original frame. - **Optimization**: Design multiple loss functions, including perceptual loss, facial landmark loss, and smoothness loss, to maintain synchronous editing, identity preservation, and video quality. - **Stitching Tuning**: By adjusting the generator, make the generated frames seamlessly blend with the original frames, and use the segmentation mask and boundary loss to optimize the alignment and appearance consistency of the editing area. ### Experimental Results - **Qualitative Comparison**: On the HDTF dataset, compared with methods such as Wav2Lip, VideoReTalking, and StyleHEAT, the videos generated by the method in this paper have higher fidelity and better lip - sync effects. - **Quantitative Comparison**: On the MEAD and HDTF datasets, the method in this paper performs well in metrics such as FID, PSNR, SSIM, and LPIPS, especially in terms of image quality and facial movement consistency. ### Conclusion This paper proposes a new framework. Through the StyleGAN - based optimization algorithm and the audio - to - landmark module, it achieves high - resolution seamless editing and emotion - driven facial animation generation. The experimental results show that this method is superior to the existing methods in generating high - quality videos.