Abstract:Several works have developed end-to-end pipelines for generating lip-synced talking faces with various real-world applications, such as teaching and language translation in videos. However, these prior works fail to create realistic-looking videos since they focus little on people's expressions and emotions. Moreover, these methods' effectiveness largely depends on the faces in the training dataset, which means they may not perform well on unseen faces. To mitigate this, we build a talking face generation framework conditioned on a categorical emotion to generate videos with appropriate expressions, making them more realistic and convincing. With a broad range of six emotions, i.e., \emph{happiness}, \emph{sadness}, \emph{fear}, \emph{anger}, \emph{disgust}, and \emph{neutral}, we show that our model can adapt to arbitrary identities, emotions, and languages. Our proposed framework is equipped with a user-friendly web interface with a real-time experience for talking face generation with emotions. We also conduct a user study for subjective evaluation of our interface's usability, design, and functionality. Project page: <a class="link-external link-https" href="https://midas.iiitd.edu.in/emo/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing methods for generating talking - face videos are deficient in terms of realism. In particular, these methods pay less attention to human expressions and emotions, resulting in the generated videos not being realistic enough. Moreover, the effectiveness of these methods depends to a large extent on the facial features in the training dataset, which means that they may not be able to handle unseen faces well. To alleviate these problems, the author constructs a talking - face - generation framework based on categorical - emotional - conditions to generate videos with appropriate expressions, making the videos more realistic and persuasive. By covering six broad emotional categories (i.e., happiness, sadness, fear, anger, disgust, and neutral), the author shows that their model can adapt to any identity, emotion, and language. In addition, they also develop a user - friendly web interface that provides a real - time talking - face - generation experience and conduct a user - experience study to evaluate the usability, design, and functionality of the interface. Specifically, the main contributions of the paper include: 1. Proposing a new deep - learning model that can generate high - fidelity lip - synchronized talking - face videos containing different emotions and corresponding expressions. 2. Introducing a multimodal framework for generating lip - synchronized videos independent of any identity, language, and emotion. 3. Developing a responsive web interface that supports real - time emotional - talking - face generation. The paper solves the shortcomings of existing methods in emotional expression by introducing an emotion encoder and an emotion discriminator to enhance the emotional expressiveness of the generated videos, thereby improving the realism and practicality of the videos.

Emotionally Enhanced Talking Face Generation

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation Via Diffusion Model

Emotion-Controllable Generalized Talking Face Generation

Talking Face Generation With Audio-Deduced Emotional Landmarks

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Audio-driven Talking Face Video Generation with Natural Head Pose

EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

GMTalker: Gaussian Mixture Based Emotional Talking Video Portraits

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

Talking Faces: Audio-to-Video Face Generation

EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion