Abstract:We present a method that generates expressive talking heads from a single facial image with audio as the only input. In contrast to previous approaches that attempt to learn direct mappings from audio to raw pixels or points for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking head dynamics. Another key component of our method is the prediction of facial landmarks reflecting speaker-aware dynamics. Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion and also animate artistic paintings, sketches, 2D cartoon characters, Japanese mangas, stylized caricatures in a single unified framework. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking heads of significantly higher quality compared to prior state-of-the-art.

What problem does this paper attempt to address?

The main problem that the paper "MakeItTalk: Speaker - Aware Talking - Head Animation" attempts to solve is to generate high - quality talking - head animations. Specifically, the authors propose a deep - learning - based method that can generate expressive talking - head videos from a single facial image and audio input. Different from previous methods that directly map from audio to pixels, the method in this paper first decouples the content information and the speaker information in the audio signal. Among them, the content information is used to control the movement of the lips and the nearby area, while the speaker information determines the specific manifestation of facial expressions and other head dynamics. This method not only improves the authenticity and expressiveness of the generated animations but also can produce reasonable results on new faces and voices that have not been seen during the training process. The key contributions of the paper include: - Introducing a new deep - learning architecture for predicting facial landmarks from voice signals. These landmarks capture not only facial expressions but also the overall head pose. - Generating speaker - aware talking - head animations based on the decoupled voice content and speaker information, which is inspired by the progress in the voice - conversion field. - Proposing two landmark - based image synthesis methods, which are respectively applicable to non - realistic cartoon images and natural human face images. These methods can handle new faces and cartoon characters that have not been observed during training. - Proposing a series of quantitative indicators and conducting user studies to evaluate the effect of the talking - head animation method. Through these innovations, the paper aims to overcome the limitations of existing technologies in generating talking - head animations with strong realism and rich expressiveness, especially in the performance when dealing with unseen faces and voices.

MakeItTalk: Speaker-Aware Talking-Head Animation

Audio-driven Talking Face Video Generation with Natural Head Pose

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Manitalk: manipulable talking head generation from single image in the wild

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style

StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

Style Transfer for 2D Talking Head Animation

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding

StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads

Audio-Driven Emotional 3D Talking-Head Generation

StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance