Style-Preserving Lip Sync via Audio-Aware Style Reference

Weizhi Zhong,Jichang Li,Yinqi Cai,Liang Lin,Guanbin Li
2024-08-10
Abstract:Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, resulting in sub-optimal lip sync conforming to the general styles. Recent lip sync techniques attempt to guide the lip sync for arbitrary audio by aggregating information from a style reference video, yet they can not preserve the speaking styles well due to their inaccuracy in style aggregation. This work proposes an innovative audio-aware style reference scheme that effectively leverages the relationships between input audio and reference audio from style reference video to address the style-preserving audio-driven lip sync. Specifically, we first develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video. Afterwards, to better render the lip motion into realistic talking face video, we devise a conditional latent diffusion model, integrating lip motion through modulated convolutional layers and fusing reference facial images via spatial cross-attention layers. Extensive experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.
Computer Vision and Pattern Recognition,Artificial Intelligence,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to preserve an individual's speaking style in audio - driven lip - sync (lip sync)**. Specifically, existing methods often overlook an individual's unique speaking style when generating lip movements, resulting in the generated lip - sync videos only conforming to the general style and being unable to accurately reflect the characteristics of a specific individual. This not only affects the accuracy of lip - sync but also reduces the authenticity and degree of personalization of the generated videos. ### Problem Background 1. **Limitations of Traditional Methods**: - Early lip - sync methods usually rely on data from specific individuals for training. Although personalized lip - sync can be achieved, this method is costly and difficult to generalize to unseen individuals. - Existing subject - generic approaches can be applied to any individual without additional training, but they usually ignore an individual's speaking style, resulting in the generated lip - sync videos only conforming to the general style and lacking personalized features. 2. **Deficiencies of Recent Methods**: - Some recent methods attempt to guide lip - sync by introducing style reference videos, but these methods are deficient in aggregating style information and cannot accurately capture and preserve an individual's speaking style. - These methods usually use static video - level style codes. When predicting lip shapes corresponding to different audios, they cannot flexibly refer to different lip shapes, resulting in poor style preservation effects. ### The Method Proposed in the Paper To solve the above problems, this paper proposes an innovative **audio - aware style reference scheme**, aiming to achieve style - preserved lip - sync in the following ways: 1. **Transformer - based Model**: - Use the Transformer architecture to predict lip movements and simultaneously aggregate speaking style information from style reference videos through cross - attention layers. - The relationship between the input audio signal and the audio signal in the style reference video is fully utilized to generate more accurate lip movement predictions. 2. **Conditional Latent Diffusion Model**: - In order to convert the predicted lip movements into real - life talking - face videos, the researchers designed a conditional latent diffusion model. - This model incorporates lip movement parameters into the generation process through modulated convolutional layers and fuses reference facial images through spatial cross - attention layers to improve the fidelity and degree of personalization of the generated videos. ### Summary The core objective of this paper is to overcome the deficiencies of existing methods in style preservation by introducing an audio - aware style reference mechanism, thereby achieving more accurate, personalized, and realistic lip - sync video generation.