ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan,Zhiliang Xu,Hang Zhou,Kaisiyuan Wang,Shengyi He,Zhanwang Zhang,Borong Liang,Haocheng Feng,Errui Ding,Jingtuo Liu,Jingdong Wang,Youjian Zhao,Ziwei Liu

2024-08-07

Abstract:Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at <a class="link-external link-https" href="https://guanjz20.github.io/projects/ReSyncer" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Graphics,Multimedia

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of generating high-fidelity lip-sync videos and supports the creation of virtual hosts. Specifically: 1. **High-Fidelity Lip-Sync**: Existing methods have some limitations in generating high-quality lip-sync videos, such as requiring long video training times or visible artifacts in the generated results. The paper proposes a unified and effective framework, `ReSyncer`, which reconfigures a style-based generator and utilizes 3D facial dynamics prediction to efficiently integrate motion and appearance information. 2. **Multiple Application Scenarios**: In addition to generating high-quality lip-sync videos, `ReSyncer` also supports various functions, including rapid personalized fine-tuning, video-driven lip-sync, speaking style transfer, and even face swapping. These features are very useful for creating virtual hosts and performers. 3. **Unified Model**: This framework implements multiple functions in a single model, including lip-sync, speaking style transfer, and face swapping, making the entire system more flexible and efficient. Through these improvements, `ReSyncer` not only enhances the quality of lip-sync videos but also expands its application range in the creation of virtual characters.

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

Style-Preserving Lip Sync via Audio-Aware Style Reference

FaceSwapNet: Landmark Guided Many-to-Many Face Reenactment

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

SyncTalklip: Highly Synchronized Lip-Readable Speaker Generation with Multi-Task Learning

Wavsyncswap: End-To-End Portrait-Customized Audio-Driven Talking Face Generation

SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Audio-driven Talking Face Video Generation with Natural Head Pose

StyleLipSync: Style-based Personalized Lip-sync Video Generation

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

A Neural Lip-Sync Framework for Synthesizing Photorealistic Virtual News Anchors

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Real-time Lip Synchronization Based on Hidden Markov Models

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition