VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Xusen Sun,Longhao Zhang,Hao Zhu,Peng Zhang,Bang Zhang,Xinya Ji,Kangneng Zhou,Daiheng Gao,Liefeng Bo,Xun Cao
2023-12-07
Abstract:Audio-driven talking head generation has drawn much attention in recent years, and many efforts have been made in lip-sync, expressive facial expressions, natural head pose generation, and high video quality. However, no model has yet led or tied on all these metrics due to the one-to-many mapping between audio and motion. In this paper, we propose VividTalk, a two-stage generic framework that supports generating high-visual quality talking head videos with all the above properties. Specifically, in the first stage, we map the audio to mesh by learning two motions, including non-rigid expression motion and rigid head motion. For expression motion, both blendshape and vertex are adopted as the intermediate representation to maximize the representation ability of the model. For natural head motion, a novel learnable head pose codebook with a two-phase training mechanism is proposed. In the second stage, we proposed a dual branch motion-vae and a generator to transform the meshes into dense motion and synthesize high-quality video frame-by-frame. Extensive experiments show that the proposed VividTalk can generate high-visual quality talking head videos with lip-sync and realistic enhanced by a large margin, and outperforms previous state-of-the-art works in objective and subjective comparisons.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to generate high - quality audio - driven talking head videos while ensuring lip - synchronization, rich facial expressions and natural head postures**. Specifically, existing methods perform differently in terms of lip - synchronization, rich facial expressions, natural head posture generation and high video quality, and no single model can achieve the best in all of these metrics. This is mainly due to the one - to - many mapping relationship between audio and motion, which makes it difficult to accurately generate facial and head motions from audio. ### Main contributions of the paper 1. **Fused blendshape and vertex offset as intermediate representations**: In order to better capture the coarse motion of facial expressions and the fine lip motion, the paper proposes to use two representation methods, blendshape and vertex offset. 2. **Proposed a learnable head pose codebook**: To solve the problem of directly learning head postures from audio, the paper introduced a learnable head pose codebook with a two - stage training mechanism to generate more reasonable and continuous head postures. 3. **Designed a two - branch motion VAE**: In the second stage, the paper proposed a two - branch motion VAE to model dense motion and finally synthesize high - quality video frames. ### Method overview The VividTalk framework is divided into two main stages: - **Audio - To - Mesh Generation**: Map audio to 3D meshes, including the learning of non - rigid facial expression motions and rigid head motions. - **Mesh - To - Video Generation**: Convert the generated 3D meshes into dense motions and synthesize high - quality video frames frame by frame. #### Audio - To - Mesh Generation In this stage, the paper utilizes a multi - branched BlendShape and Vertex Offset Generator and a Learnable Head Pose Codebook to handle facial expressions and head postures respectively. - **BlendShape and Vertex Offset Generator**: Through a multi - branched Transformer network, learn the coarse motion of facial expressions (blendshape) and the fine lip motion (vertex offset) from audio. The formulas are as follows: \[ \hat{\beta}_f^i = \Phi_{bs}^i(\hat{\beta}_{1...f - 1}^i, A, z_{style}), \quad i \in \{\text{lip}, \text{other}\} \] \[ \hat{O}_f^{\text{lip}} = \Phi_{vo}^{\text{lip}}(\hat{O}_{1...f - 1}^{\text{lip}}, A, z_{style}) \] - **Learnable Head Pose Codebook**: Through a two - stage training mechanism, learn a discrete head pose codebook to generate natural and continuous head postures. The formulas are as follows: \[ Z_q = q(\hat{z}) = \arg \min_{z_k \in Z} \| \hat{z} - z_k \| \] \[ \hat{P}_{1:f}^r = D(Z_q) = D(q(E(P_{1:f}^r))) \] #### Mesh - To - Video Generation In this stage, the paper designed a two - branch motion VAE to convert the motion in the 3D domain into the dense motion in the 2D domain and finally synthesize high - quality video frames. ### Experimental results The paper verified the effectiveness of VividTalk through multiple quantitative and qualitative experiments. The results show that VividTalk outperforms existing methods in terms of lip - synchronization, identity preservation and head pose diversity. In addition, user studies also indicate that VividTalk scores the highest in terms of overall quality, lip - synchronization, motion naturalness and identity preservation. In conclusion, this paper, through innovative methods...