Abstract:Real-world talking faces often accompany with natural head movement. However, most existing talking face video generation methods only consider facial animation with fixed head pose. In this paper, we address this problem by proposing a deep neural network model that takes an audio signal A of a source person and a very short video V of a target person as input, and outputs a synthesized high-quality talking face video with personalized head pose (making use of the visual information in V), expression and lip synchronization (by considering both A and V). The most challenging issue in our work is that natural poses often cause in-plane and out-of-plane head rotations, which makes synthesized talking face video far from realistic. To address this challenge, we reconstruct 3D face animation and re-render it into synthesized frames. To fine tune these frames into realistic ones with smooth background transition, we propose a novel memory-augmented GAN module. By first training a general mapping based on a publicly available dataset and fine-tuning the mapping using the input short video of target person, we develop an effective strategy that only requires a small number of frames (about 300 frames) to learn personalized talking behavior including head pose. Extensive experiments and two user studies show that our method can generate high-quality (i.e., personalized head movements, expressions and good lip synchronization) talking face videos, which are naturally looking with more distinguishing head movement effects than the state-of-the-art methods.

Speech Driven Face Animation based on Dynamic Concatenation Model

Dynamic mapping method based speech driven face animation system

Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Expressive Face Animation Synthesis Based on Dynamic Mapping Method

Audio-driven Talking Face Video Generation with Natural Head Pose

Real-time Speech-Driven Animation of Expressive Talking Faces.

Stylized Synthesis of Facial Speech Motions

3D Realistic Talking Face Co-Driven by Text and Speech

Speech Driven Facial Animation Using Chinese Mandarin Pronunciation Rules

Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Speech-driven Facial Animation with Spectral Gathering and Temporal Attention.

Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

Face Animation Based on Large Audiovisual Database

A Novel Speech-Driven Lip-Sync Model with CNN and LSTM

Data Mining and Speech Driven Face Animation

Real-time speech-driven lip synchronization

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

An Expressive TTVS System Based on Dynamic Unit Selection

Text-To-Visual Speech in Chinese Based on Data-Driven Approach