RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Xiaozhong Ji,Chuming Lin,Zhonggan Ding,Ying Tai,Junwei Zhu,Xiaobin Hu,Donghao Luo,Yanhao Ge,Chengjie Wang
2024-08-08
Abstract:Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Aims to Solve This paper aims to address the issue of generating high-fidelity talking face videos in real-time. Specifically, it tackles the following challenges: 1. **Lip-Sync**: Ensuring that the lip movements in the generated face video are precisely synchronized with the input audio. 2. **High-Quality Facial Rendering**: Generating high-resolution, realistic facial images while preserving facial details and textures. 3. **Identity Consistency**: Maintaining the consistency of the generated expressions and facial features with the original individual. 4. **Efficiency**: Enhancing the generation speed to enable real-time application in practical scenarios. Although existing methods have made some progress in certain aspects, they still have shortcomings. For example, real-time methods like Wav2Lip and TalkLip are fast but lack ideal visual effects; non-real-time methods like IP-LAP generate high-quality results but have low computational efficiency, making them difficult to meet practical application needs. To address these issues, the paper proposes a new framework called RealTalk, which includes two main components: 1. **Audio-to-Expression Transformer**: Converts input audio into 3D expression coefficients, utilizing cross-modal attention mechanisms to improve facial prior information. 2. **Expression-to-Face Renderer**: Generates high-fidelity talking face videos from the estimated 3D expressions, employing a lightweight Facial Identity Alignment (FIA) module. Through these designs, RealTalk achieves a balance between real-time performance and high-fidelity generation effects. Experimental results on public datasets demonstrate its significant advantages in lip-sync and generation quality.