Abstract:To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.

Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation

Cgan Based Facial Expression Recognition for Human-Robot Interaction

Realistic Face Reenactment Via Self-Supervised Disentangling of Identity and Pose

DR-FER: Discriminative and Robust Representation Learning for Facial Expression Recognition

EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Efficient Facial Expression Recognition with Representation Reinforcement Network and Transfer Self-Training for Human–Machine Interaction

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data

Audio-Driven Emotional Video Portraits

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Continuously Controllable Facial Expression Editing in Talking Face Videos

D2SP: Dynamic Dual-Stage Purification Framework for Dual Noise Mitigation in Vision-based Affective Recognition

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

Neural Emotion Director: Speech-preserving semantic control of facial expressions in "in-the-wild" videos

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition