EmoTalker: Emotionally Editable Talking Face Generation via Diffusion Model

Bingyuan Zhang,Xulong Zhang,Ning Cheng,Jun Yu,Jing Xiao,Jianzong Wang
2024-01-16
Abstract:In recent years, the field of talking faces generation has attracted considerable attention, with certain methods adept at generating virtual faces that convincingly imitate human expressions. However, existing methods face challenges related to limited generalization, particularly when dealing with challenging identities. Furthermore, methods for editing expressions are often confined to a singular emotion, failing to adapt to intricate emotions. To overcome these challenges, this paper proposes EmoTalker, an emotionally editable portraits animation approach based on the diffusion model. EmoTalker modifies the denoising process to ensure preservation of the original portrait's identity during inference. To enhance emotion comprehension from text input, Emotion Intensity Block is introduced to analyze fine-grained emotions and strengths derived from prompts. Additionally, a crafted dataset is harnessed to enhance emotion comprehension within prompts. Experiments show the effectiveness of EmoTalker in generating high-quality, emotionally customizable facial expressions.
Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on two aspects: 1. **Limited generalization ability**: The existing talking - face generation methods have limited generalization ability when dealing with challenging identities. These methods often fail to generate high - quality facial expressions when faced with complex or difficult - to - handle identities. 2. **Limitations of single - emotion editing**: Current methods are usually limited to a single emotion when editing expressions and lack the adaptability to complex emotions. This results in a limited range of emotional expressions and the inability to accurately convey diverse emotional states. To overcome these challenges, the paper proposes **EmoTalker**, an emotion - editable talking - face generation framework based on the diffusion model. The main contributions of EmoTalker include: - **Conditional diffusion model**: A special conditional diffusion model is proposed. It guides the denoising process through the complex emotions and intensities contained in the text prompt to generate the desired expressions. - **Denoising process that preserves the original portrait identity**: The denoising mechanism in the inference process is modified to ensure that the generated frames are consistent with the identity of the original portrait, thereby improving the generalization ability. - **Emotion intensity block and new dataset**: The emotion intensity block (Emotion Intensity Block) and a new dataset FED are introduced to enhance the model's understanding of complex emotions and intensities. Through these improvements, EmoTalker can generate high - quality and emotion - customizable facial expressions, significantly improving the performance of talking - face generation technology.