RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

Xiaozhong Ji,Chuming Lin,Zhonggan Ding,Ying Tai,Junwei Zhu,Xiaobin Hu,Donghao Luo,Yanhao Ge,Chengjie Wang

2024-08-08

Abstract:Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Aims to Solve This paper aims to address the issue of generating high-fidelity talking face videos in real-time. Specifically, it tackles the following challenges: 1. **Lip-Sync**: Ensuring that the lip movements in the generated face video are precisely synchronized with the input audio. 2. **High-Quality Facial Rendering**: Generating high-resolution, realistic facial images while preserving facial details and textures. 3. **Identity Consistency**: Maintaining the consistency of the generated expressions and facial features with the original individual. 4. **Efficiency**: Enhancing the generation speed to enable real-time application in practical scenarios. Although existing methods have made some progress in certain aspects, they still have shortcomings. For example, real-time methods like Wav2Lip and TalkLip are fast but lack ideal visual effects; non-real-time methods like IP-LAP generate high-quality results but have low computational efficiency, making them difficult to meet practical application needs. To address these issues, the paper proposes a new framework called RealTalk, which includes two main components: 1. **Audio-to-Expression Transformer**: Converts input audio into 3D expression coefficients, utilizing cross-modal attention mechanisms to improve facial prior information. 2. **Expression-to-Face Renderer**: Generates high-fidelity talking face videos from the estimated 3D expressions, employing a lightweight Facial Identity Alignment (FIA) module. Through these designs, RealTalk achieves a balance between real-time performance and high-fidelity generation effects. Experimental results on public datasets demonstrate its significant advantages in lip-sync and generation quality.

RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network

APB2FaceV2: Real-Time Audio-Guided Multi-Face Reenactment

Audio-driven Talking Face Video Generation with Natural Head Pose

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition

GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation

FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio

High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Spatially and Temporally Optimized Audio‐Driven Talking Face Generation

GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment

Photorealistic Audio-driven Video Portraits

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Audio-Driven 3D Facial Animation from In-the-Wild Videos

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior

Realistic Speech-to-Face Generation with Speech-Conditioned Latent Diffusion Model with Face Prior

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Audio-driven talking face generation with diverse yet realistic facial animations