Abstract:In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singular emotion, failing to adapt to intricate emotions. To overcome these challenges, this paper proposes EmoTalker, an emotionally editable portraits animation approach based on the diffusion model. EmoTalker modifies the denoising process to ensure preservation of the original portrait's identity during inference. To enhance emotion comprehension from text input, Emotion Intensity Block is introduced to analyze fine-grained emotions and strengths derived from prompts. Additionally, a crafted dataset is harnessed to enhance emotion comprehension within prompts. Experiments show the effectiveness of EmoTalker in generating high-quality, emotionally customizable facial expressions.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on two aspects: 1. **Limited generalization ability**: The existing talking - face generation methods have limited generalization ability when dealing with challenging identities. These methods often fail to generate high - quality facial expressions when faced with complex or difficult - to - handle identities. 2. **Limitations of single - emotion editing**: Current methods are usually limited to a single emotion when editing expressions and lack the adaptability to complex emotions. This results in a limited range of emotional expressions and the inability to accurately convey diverse emotional states. To overcome these challenges, the paper proposes **EmoTalker**, an emotion - editable talking - face generation framework based on the diffusion model. The main contributions of EmoTalker include: - **Conditional diffusion model**: A special conditional diffusion model is proposed. It guides the denoising process through the complex emotions and intensities contained in the text prompt to generate the desired expressions. - **Denoising process that preserves the original portrait identity**: The denoising mechanism in the inference process is modified to ensure that the generated frames are consistent with the identity of the original portrait, thereby improving the generalization ability. - **Emotion intensity block and new dataset**: The emotion intensity block (Emotion Intensity Block) and a new dataset FED are introduced to enhance the model's understanding of complex emotions and intensities. Through these improvements, EmoTalker can generate high - quality and emotion - customizable facial expressions, significantly improving the performance of talking - face generation technology.

EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait

EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation Via Diffusion Model

DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

Continuously Controllable Facial Expression Editing in Talking Face Videos

Emotionally Enhanced Talking Face Generation

High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models

Talking Face Generation With Audio-Deduced Emotional Landmarks

EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Speech Driven Talking Face Generation from a Single Image and an Emotion Condition