Abstract:We present a real-time facial tracking and animation system based on a Kinect sensor with video and audio input. Our method requires no user-specific training and is robust to occlusions, large head rotations, and background noise. Given the color, depth and speech audio frames captured from an actor, our system first reconstructs 3D facial expressions and 3D mouth shapes from color and depth input with a multi-linear model. Concurrently a speaker-independent DNN acoustic model is applied to extract phoneme state posterior probabilities (PSPP) from the audio frames. After that, a lip motion regressor refines the 3D mouth shape based on both PSPP and expression weights of the 3D mouth shapes, as well as their confidences. Finally, the refined 3D mouth shape is combined with other parts of the 3D face to generate the final result. The whole process is fully automatic and executed in real time. The key component of our system is a data-driven regresor for modeling the correlation between speech data and mouth shapes. Based on a precaptured database of accurate 3D mouth shapes and associated speech audio from one speaker, the regressor jointly uses the input speech and visual features to refine the mouth shape of a new actor. We also present an improved DNN acoustic model. It not only preserves accuracy but also achieves real-time performance. Our method efficiently fuses visual and acoustic information for 3D facial performance capture. It generates more accurate 3D mouth motions than other approaches that are based on audio or video input only. It also supports video or audio only input for real-time facial animation. We evaluate the performance of our system with speech and facial expressions captured from different actors. Results demonstrate the efficiency and robustness of our method.

Video Realistic Mouth Animation Based on an Audio Visual DBN Model with Articulatory Features and Constrained Asynchrony

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

APB2FACE: Audio-Guided Face Reenactment with Auxiliary Pose and Blink Signals.

Acoustic VR in the Mouth: A Real-Time Speech-Driven Visual Tongue System.

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

Video-audio Driven Real-Time Facial Animation.

Speaker-independent Lips and Tongue Visualization of Vowels

3D Facial Animation from Chinese Text.

Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis

A Speech-Driven 3-D Lip Synthesis with Realistic Dynamics in Mandarin Chinese

Mining Audio/Visual Database For Speech Driven Face Animation

Audio-Driven 3D Facial Animation from In-the-Wild Videos

Audio-driven Talking Face Video Generation with Natural Head Pose

VividWav2Lip: High-Fidelity Facial Animation Generation Based on Speech-Driven Lip Synchronization

A Novel Speech to Mouth Articulation System for Realistic Humanoid Robots

LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation

VisemeNet: Audio-Driven Animator-Centric Speech Animation

Attention-Based VR Facial Animation with Visual Mouth Camera Guidance for Immersive Telepresence Avatars

A real-time speech driven talking avatar based on deep neural network.

A Speech-Driven 3-D Tongue Model with Realistic Movement in Mandarin Chinese.

Lip Movement Generation Using Restricted Boltzmann Machines For Visual Speech Synthesis