Abstract:The existing methods for audio-driven talking head video editing have the limitations of poor visual effects. This paper tries to tackle this problem through editing talking face images seamless with different emotions based on two modules: (1) an audio-to-landmark module, consisting of the CrossReconstructed Emotion Disentanglement and an alignment network module. It bridges the gap between speech and facial motions by predicting corresponding emotional landmarks from speech; (2) a landmark-based editing module edits face videos via StyleGAN. It aims to generate the seamless edited video consisting of the emotion and content components from the input audio. Extensive experiments confirm that compared with state-of-the-arts methods, our method provides high-resolution videos with high visual quality.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiency in visual effects of the existing audio - driven talking - head video editing methods. Specifically, the existing methods have obvious editing traces and blurring effects when generating high - resolution videos. This paper proposes a new framework to improve the resolution and visual quality of videos through StyleGAN - based editing methods, achieve seamless editing, and be able to generate different expressions according to the emotions in the input audio. ### Main Contributions 1. **Propose a new framework based on StyleGAN**: This framework can achieve high - resolution synchronous generation and seamless editing, and generate different expressions according to the emotions embedded in the input audio. 2. **Introduce an optimization algorithm**: This algorithm generates facial - edited videos through StyleGAN under the supervision of facial landmarks, while maintaining the identity of the original video characters and the smoothness of the video. 3. **Develop an audio - to - landmark module**: This module can generate facial landmarks that are aligned in emotion and posture with the target person when speaking in the audio. An effective alignment module is designed, using the cross - attention mechanism to promote this process. ### Method Overview 1. **Audio - to - Landmark Module**: - **Emotion Decoupling and Prediction**: Adopt the cross - reconstruction emotion decoupling method to extract the emotion and content components in the audio, and predict the landmark displacement through the LSTM network and a two - layer MLP. - **Alignment**: Design an alignment network, using the cross - attention mechanism to align the predicted landmarks with the original landmarks to prevent facial information leakage. 2. **Landmark - based Editing Module**: - **Inversion**: Map each video frame inversely into the latent space of the GAN, and use the PTI method to adjust the generator to reconstruct the original frame. - **Optimization**: Design multiple loss functions, including perceptual loss, facial landmark loss, and smoothness loss, to maintain synchronous editing, identity preservation, and video quality. - **Stitching Tuning**: By adjusting the generator, make the generated frames seamlessly blend with the original frames, and use the segmentation mask and boundary loss to optimize the alignment and appearance consistency of the editing area. ### Experimental Results - **Qualitative Comparison**: On the HDTF dataset, compared with methods such as Wav2Lip, VideoReTalking, and StyleHEAT, the videos generated by the method in this paper have higher fidelity and better lip - sync effects. - **Quantitative Comparison**: On the MEAD and HDTF datasets, the method in this paper performs well in metrics such as FID, PSNR, SSIM, and LPIPS, especially in terms of image quality and facial movement consistency. ### Conclusion This paper proposes a new framework. Through the StyleGAN - based optimization algorithm and the audio - to - landmark module, it achieves high - resolution seamless editing and emotion - driven facial animation generation. The experimental results show that this method is superior to the existing methods in generating high - quality videos.

Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

Audio-driven Talking Face Video Generation with Natural Head Pose

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Continuously Controllable Facial Expression Editing in Talking Face Videos

Talking Face Generation With Audio-Deduced Emotional Landmarks

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Audio-Driven Emotional Video Portraits

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Audio-Driven Emotional 3D Talking-Head Generation

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition