Abstract:Talking face generation is a demanding task to synthesize a high quality video with accurate lip synchronization and rhythmic head motion. However, existing methods always suffer from unrealistic facial animations, because 1) they only take single-mode input, but ignore the complementarity of multimodal inputs for lip-sync improvement; 2) they only explore lip movements, but ignore the articulator synergy between lips and jaw; 3) they generate each video frame in a temporal-independent way, but ignore the temporal continuity among the entire video. To address these limitations, in this paper, we present a novel method to generate realistic and temporally coherent talking heads by considering multimodal inputs, articulator synergy, inter-frame consistency and intra-frame consistency. Firstly, for landmark prediction, a novel Multiple Synergy Network (MSN) is proposed to improve the accuracy of landmark prediction by incorporating multimodal inputs (i.e., audio and text inputs). Besides, instead of merely considering lip landmarks, we also explore the jaw movements to ensure articulator synergy among lips and jaw. Secondly, for realistic video generation, a Video Consistency Network (VCN) is proposed conditioned on the predicted landmarks. In VCN, the optical flow is adopted to model the temporal continuity between frames to ensure inter-frame consistency. Meanwhile, a mouth generation branch is proposed to enhance mouth texture and the corresponding mouth mask is employed to ensure intra-frame consistency between the mouth area and the others. Extensive experiments demonstrate that our approach exhibits excellent superiority on lip-sync and can generate photo-realistic facial animations. Project is available at http://imcc.ustc.edu.cn/project/tfgen/.

Realistic Mouth Animation Synthesis Based on Articulatory DBN Models

Realistic Mouth Animation Based on an Articulatory DBN Model with Constrained Asynchrony

Video Realistic Mouth Animation Based on an Audio Visual DBN Model with Articulatory Features and Constrained Asynchrony

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

Speech Driven Facial Animation Synthesis Based on State Asynchronous DBN

APB2FACE: Audio-Guided Face Reenactment with Auxiliary Pose and Blink Signals.

Speech driven photo-realistic face animation with mouth and jaw dynamics

GA-Based Speaking Mouth Correlative Speech Feature Abstraction

Lip Movement Generation Using Restricted Boltzmann Machines For Visual Speech Synthesis

Audio-driven Talking Face Video Generation with Natural Head Pose

Speech Driven Realistic Mouth Animation Based on Multi-Modal Unit Selection.

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Audio to Deep Visual: Speaking Mouth Generation Based on 3D Sparse Landmarks

Audio Visual Speech Recognition Based on Multi-Stream DBN Models with Articulatory Features

Audio-Semantic Enhanced Pose-Driven Talking Head Generation

Dimensional Emotion Driven Facial Expression Synthesis Based on the Multi-Stream DBN Model

3D Facial Animation from Chinese Text.

A Realistic 3d Articulatory Animation System for Emotional Visual Pronunciation

Multimodal Learning for Temporally Coherent Talking Face Generation with Articulator Synergy

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

A Synthesis Method of Three-dimensional Facial Animation with Multiple Elements Blending Based on MPEG-4