Abstract:Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion. For expression motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For natural head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to generate high - quality audio - driven talking head videos while ensuring lip - synchronization, rich facial expressions and natural head postures**. Specifically, existing methods perform differently in terms of lip - synchronization, rich facial expressions, natural head posture generation and high video quality, and no single model can achieve the best in all of these metrics. This is mainly due to the one - to - many mapping relationship between audio and motion, which makes it difficult to accurately generate facial and head motions from audio. ### Main contributions of the paper 1. **Fused blendshape and vertex offset as intermediate representations**: In order to better capture the coarse motion of facial expressions and the fine lip motion, the paper proposes to use two representation methods, blendshape and vertex offset. 2. **Proposed a learnable head pose codebook**: To solve the problem of directly learning head postures from audio, the paper introduced a learnable head pose codebook with a two - stage training mechanism to generate more reasonable and continuous head postures. 3. **Designed a two - branch motion VAE**: In the second stage, the paper proposed a two - branch motion VAE to model dense motion and finally synthesize high - quality video frames. ### Method overview The VividTalk framework is divided into two main stages: - **Audio - To - Mesh Generation**: Map audio to 3D meshes, including the learning of non - rigid facial expression motions and rigid head motions. - **Mesh - To - Video Generation**: Convert the generated 3D meshes into dense motions and synthesize high - quality video frames frame by frame. #### Audio - To - Mesh Generation In this stage, the paper utilizes a multi - branched BlendShape and Vertex Offset Generator and a Learnable Head Pose Codebook to handle facial expressions and head postures respectively. - **BlendShape and Vertex Offset Generator**: Through a multi - branched Transformer network, learn the coarse motion of facial expressions (blendshape) and the fine lip motion (vertex offset) from audio. The formulas are as follows: \[ \hat{\beta}_f^i = \Phi_{bs}^i(\hat{\beta}_{1...f - 1}^i, A, z_{style}), \quad i \in \{\text{lip}, \text{other}\} \] \[ \hat{O}_f^{\text{lip}} = \Phi_{vo}^{\text{lip}}(\hat{O}_{1...f - 1}^{\text{lip}}, A, z_{style}) \] - **Learnable Head Pose Codebook**: Through a two - stage training mechanism, learn a discrete head pose codebook to generate natural and continuous head postures. The formulas are as follows: \[ Z_q = q(\hat{z}) = \arg \min_{z_k \in Z} \| \hat{z} - z_k \| \] \[ \hat{P}_{1:f}^r = D(Z_q) = D(q(E(P_{1:f}^r))) \] #### Mesh - To - Video Generation In this stage, the paper designed a two - branch motion VAE to convert the motion in the 3D domain into the dense motion in the 2D domain and finally synthesize high - quality video frames. ### Experimental results The paper verified the effectiveness of VividTalk through multiple quantitative and qualitative experiments. The results show that VividTalk outperforms existing methods in terms of lip - synchronization, identity preservation and head pose diversity. In addition, user studies also indicate that VividTalk scores the highest in terms of overall quality, lip - synchronization, motion naturalness and identity preservation. In conclusion, this paper, through innovative methods...

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Audio-driven Talking Face Video Generation with Natural Head Pose

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Towards Realistic Conversational Head Generation: A Comprehensive Framework for Lifelike Video Synthesis

High-Fidelity and Freely Controllable Talking Head Video Generation

MakeItTalk: Speaker-Aware Talking-Head Animation

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Audio-Driven Emotional 3D Talking-Head Generation

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Persons

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation.

Talking-head Generation with Rhythmic Head Motion

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

Meta Talk: Learning To Data-Efficiently Generate Audio-Driven Lip-Synchronized Talking Face With High Definition