Abstract:With the continuous development of cross-modality generation, audio-driven talking face generation has made substantial advances in terms of speech content and mouth shape, but existing research on talking face emotion generation is still relatively unsophisticated. In this work, we present Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait to synthesize lip-sync and an emotionally controllable high-quality talking face. Specifically, we take a facial reenactment perspective, using facial landmarks as an intermediate representation driving the expression generation of talking faces through the landmark features of an arbitrary emotional portrait. Meanwhile, decoupled design ideas are used to divide the model into three sub-networks to improve emotion control. They are the lip-sync landmark animation generation network, the emotional landmark animation generation network, and the landmark-to-animation translation network. The two landmark animation generation networks are responsible for generating content-related lip area landmarks and facial expression landmarks to correct the landmark sequences of the target portrait. Following this, the corrected landmark sequences and the target portrait are fed into the translation network to generate an emotionally controllable talking face. Our method controls the expressions of talking faces by driving the emotional portrait images while ensuring the generation of animated lip-sync, and can handle new audio and portraits not seen during training. A multi-perspective user study and extensive quantitative and qualitative evaluations demonstrate the superiority of the system in terms of visual emotion representation and video authenticity.

Talking Face Video Generation with Editable Expression

Audio-driven Talking Face Video Generation with Natural Head Pose

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Dynamic Neural Textures: Generating Talking-Face Videos with Continuously Controllable Expressions

Talking Faces: Audio-to-Video Face Generation

Continuously Controllable Facial Expression Editing in Talking Face Videos

Controllable Talking Face Generation by Implicit Facial Keypoints Editing

CP-EB: Talking Face Generation with Controllable Pose and Eye Blinking Embedding

You Said That?: Synthesising Talking Faces from Audio

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

Talking Face Generation With Audio-Deduced Emotional Landmarks

JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation

TellMeTalk: Multimodal-driven talking face video generation

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

Multimodal Inputs Driven Talking Face Generation With Spatial–Temporal Dependency

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Generating Smooth and Facial-Details-Enhanced Talking Head Video: A Perspective of Pre and Post Processes