ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Jiazhi Guan,Zhiliang Xu,Hang Zhou,Kaisiyuan Wang,Shengyi He,Zhanwang Zhang,Borong Liang,Haocheng Feng,Errui Ding,Jingtuo Liu,Jingdong Wang,Youjian Zhao,Ziwei Liu
2024-08-07
Abstract:Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at <a class="link-external link-https" href="https://guanjz20.github.io/projects/ReSyncer" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Graphics,Multimedia
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of generating high-fidelity lip-sync videos and supports the creation of virtual hosts. Specifically: 1. **High-Fidelity Lip-Sync**: Existing methods have some limitations in generating high-quality lip-sync videos, such as requiring long video training times or visible artifacts in the generated results. The paper proposes a unified and effective framework, `ReSyncer`, which reconfigures a style-based generator and utilizes 3D facial dynamics prediction to efficiently integrate motion and appearance information. 2. **Multiple Application Scenarios**: In addition to generating high-quality lip-sync videos, `ReSyncer` also supports various functions, including rapid personalized fine-tuning, video-driven lip-sync, speaking style transfer, and even face swapping. These features are very useful for creating virtual hosts and performers. 3. **Unified Model**: This framework implements multiple functions in a single model, including lip-sync, speaking style transfer, and face swapping, making the entire system more flexible and efficient. Through these improvements, `ReSyncer` not only enhances the quality of lip-sync videos but also expands its application range in the creation of virtual characters.