Abstract:Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of insufficient high-frequency details in audio-driven speaker head video generation. Although existing end-to-end methods have made significant progress in lip synchronization, they still face challenges in generating videos with high-frequency details such as hair, facial wrinkles, moles, eyelashes, and lip contours. These issues are mainly due to the limitations of existing methods in expressing high-frequency details. Specifically, the paper proposes a new method called LaDTalk, which introduces a Spatially Optimized Vector Quantization Autoencoder (SOVQAE) for post-processing on the pre-trained Wav2Lip model to restore high-frequency texture details. LaDTalk not only improves video quality but also performs excellently in cross-audio synchronization. ### Main Contributions 1. **Introduction of a new HFTK dataset**: Specifically designed for evaluating high-frequency detail generation in speaker face videos. 2. **Exploration of VQAE's noise robustness**: Demonstrated the noise robustness of VQAE in latent space through Lipschitz continuity theory. 3. **Development of SOVQAE**: Verified its design principles through rigorous theoretical analysis, enhancing the ability to generate high-quality speaker face videos. 4. **Experimental validation**: Extensive experiments show that LaDTalk outperforms existing state-of-the-art methods across various metrics. ### Solution 1. **Wav2Lip as the base model**: Utilizes its strong audio-lip alignment capability to generate low-resolution lip-synced videos. 2. **SOVQAE for post-processing**: Restores high-frequency texture details by denoising in latent space, generating high-quality videos. 3. **Theoretical analysis**: Demonstrated the noise robustness of VQAE in latent space through Lipschitz continuity theory and proposed a codebook regularization loss to enhance this robustness. 4. **Experimental evaluation**: Comprehensive experiments on public datasets and the self-built HFTK dataset validate the superiority of LaDTalk in video quality and lip synchronization accuracy. ### Conclusion By introducing the LaDTalk framework, this paper successfully addresses the limitations of existing methods in generating high-frequency detail-rich speaker face videos, providing a new solution for the field of audio-driven speaker face video generation.

LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

Audio-driven Talking Face Video Generation with Natural Head Pose

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis

Towards Realistic Visual Dubbing with Heterogeneous Sources

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder

Generating Smooth and Facial-Details-Enhanced Talking Head Video: A Perspective of Pre and Post Processes

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance

Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

Audio-Driven Emotional 3D Talking-Head Generation

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion