Abstract:Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of individuals, posing a notable challenge for audio-driven lip sync. Earlier methods for such task often bypassed the modeling of personalized speaking styles, resulting in sub-optimal lip sync conforming to the general styles. Recent lip sync techniques attempt to guide the lip sync for arbitrary audio by aggregating information from a style reference video, yet they can not preserve the speaking styles well due to their inaccuracy in style aggregation. This work proposes an innovative audio-aware style reference scheme that effectively leverages the relationships between input audio and reference audio from style reference video to address the style-preserving audio-driven lip sync. Specifically, we first develop an advanced Transformer-based model adept at predicting lip motion corresponding to the input audio, augmented by the style information aggregated through cross-attention layers from style reference video. Afterwards, to better render the lip motion into realistic talking face video, we devise a conditional latent diffusion model, integrating lip motion through modulated convolutional layers and fusing reference facial images via spatial cross-attention layers. Extensive experiments validate the efficacy of the proposed approach in achieving precise lip sync, preserving speaking styles, and generating high-fidelity, realistic talking face videos.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to preserve an individual's speaking style in audio - driven lip - sync (lip sync)**. Specifically, existing methods often overlook an individual's unique speaking style when generating lip movements, resulting in the generated lip - sync videos only conforming to the general style and being unable to accurately reflect the characteristics of a specific individual. This not only affects the accuracy of lip - sync but also reduces the authenticity and degree of personalization of the generated videos. ### Problem Background 1. **Limitations of Traditional Methods**: - Early lip - sync methods usually rely on data from specific individuals for training. Although personalized lip - sync can be achieved, this method is costly and difficult to generalize to unseen individuals. - Existing subject - generic approaches can be applied to any individual without additional training, but they usually ignore an individual's speaking style, resulting in the generated lip - sync videos only conforming to the general style and lacking personalized features. 2. **Deficiencies of Recent Methods**: - Some recent methods attempt to guide lip - sync by introducing style reference videos, but these methods are deficient in aggregating style information and cannot accurately capture and preserve an individual's speaking style. - These methods usually use static video - level style codes. When predicting lip shapes corresponding to different audios, they cannot flexibly refer to different lip shapes, resulting in poor style preservation effects. ### The Method Proposed in the Paper To solve the above problems, this paper proposes an innovative **audio - aware style reference scheme**, aiming to achieve style - preserved lip - sync in the following ways: 1. **Transformer - based Model**: - Use the Transformer architecture to predict lip movements and simultaneously aggregate speaking style information from style reference videos through cross - attention layers. - The relationship between the input audio signal and the audio signal in the style reference video is fully utilized to generate more accurate lip movement predictions. 2. **Conditional Latent Diffusion Model**: - In order to convert the predicted lip movements into real - life talking - face videos, the researchers designed a conditional latent diffusion model. - This model incorporates lip movement parameters into the generation process through modulated convolutional layers and fuses reference facial images through spatial cross - attention layers to improve the fidelity and degree of personalization of the generated videos. ### Summary The core objective of this paper is to overcome the deficiencies of existing methods in style preservation by introducing an audio - aware style reference mechanism, thereby achieving more accurate, personalized, and realistic lip - sync video generation.

Style-Preserving Lip Sync via Audio-Aware Style Reference

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

Content and Style Aware Audio-Driven Facial Animation

TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

StyleLipSync: Style-based Personalized Lip-sync Video Generation

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

Say Anything with Any Style

Style Transfer for 2D Talking Head Animation

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance