LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details

Jian Yang,Xukun Wang,Wentao Wang,Guoming Li,Qihang Fang,Ruihong Yuan,Tianyang Wang,Jason Zhaoxin Fan
2024-10-02
Abstract:Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of insufficient high-frequency details in audio-driven speaker head video generation. Although existing end-to-end methods have made significant progress in lip synchronization, they still face challenges in generating videos with high-frequency details such as hair, facial wrinkles, moles, eyelashes, and lip contours. These issues are mainly due to the limitations of existing methods in expressing high-frequency details. Specifically, the paper proposes a new method called LaDTalk, which introduces a Spatially Optimized Vector Quantization Autoencoder (SOVQAE) for post-processing on the pre-trained Wav2Lip model to restore high-frequency texture details. LaDTalk not only improves video quality but also performs excellently in cross-audio synchronization. ### Main Contributions 1. **Introduction of a new HFTK dataset**: Specifically designed for evaluating high-frequency detail generation in speaker face videos. 2. **Exploration of VQAE's noise robustness**: Demonstrated the noise robustness of VQAE in latent space through Lipschitz continuity theory. 3. **Development of SOVQAE**: Verified its design principles through rigorous theoretical analysis, enhancing the ability to generate high-quality speaker face videos. 4. **Experimental validation**: Extensive experiments show that LaDTalk outperforms existing state-of-the-art methods across various metrics. ### Solution 1. **Wav2Lip as the base model**: Utilizes its strong audio-lip alignment capability to generate low-resolution lip-synced videos. 2. **SOVQAE for post-processing**: Restores high-frequency texture details by denoising in latent space, generating high-quality videos. 3. **Theoretical analysis**: Demonstrated the noise robustness of VQAE in latent space through Lipschitz continuity theory and proposed a codebook regularization loss to enhance this robustness. 4. **Experimental evaluation**: Comprehensive experiments on public datasets and the self-built HFTK dataset validate the superiority of LaDTalk in video quality and lip synchronization accuracy. ### Conclusion By introducing the LaDTalk framework, this paper successfully addresses the limitations of existing methods in generating high-frequency detail-rich speaker face videos, providing a new solution for the field of audio-driven speaker face video generation.