Abstract:With the continuous development of cross-modality generation, audio-driven talking face generation has made substantial advances in terms of speech content and mouth shape, but existing research on talking face emotion generation is still relatively unsophisticated. In this work, we present Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait to synthesize lip-sync and an emotionally controllable high-quality talking face. Specifically, we take a facial reenactment perspective, using facial landmarks as an intermediate representation driving the expression generation of talking faces through the landmark features of an arbitrary emotional portrait. Meanwhile, decoupled design ideas are used to divide the model into three sub-networks to improve emotion control. They are the lip-sync landmark animation generation network, the emotional landmark animation generation network, and the landmark-to-animation translation network. The two landmark animation generation networks are responsible for generating content-related lip area landmarks and facial expression landmarks to correct the landmark sequences of the target portrait. Following this, the corrected landmark sequences and the target portrait are fed into the translation network to generate an emotionally controllable talking face. Our method controls the expressions of talking faces by driving the emotional portrait images while ensuring the generation of animated lip-sync, and can handle new audio and portraits not seen during training. A multi-perspective user study and extensive quantitative and qualitative evaluations demonstrate the superiority of the system in terms of visual emotion representation and video authenticity.

Audio-Driven Emotional 3D Talking-Head Generation

Audio-driven Talking Face Video Generation with Natural Head Pose

Audio-Driven Emotional Video Portraits

AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

Emotional Semantic Neural Radiance Fields for Audio-Driven Talking Head.

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Talking Face Generation With Audio-Deduced Emotional Landmarks

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Audio-Semantic Enhanced Pose-Driven Talking Head Generation

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation.

EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation Via Diffusion Model