Abstract:Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at <a class="link-external link-https" href="https://github.com/SJTU-Lucy/EmoFace" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that when generating 3D facial animations, how to make the animations not only synchronized with the audio, but also express rich emotions, and be able to generate blinks and eye movements naturally, thereby improving the realism and interactive experience of virtual characters. Specifically, the main challenges mentioned in the paper include: 1. **Lack of emotional expression**: Although the existing audio - driven facial animation technologies can achieve good lip - synchronization effects, they usually lack emotional expression. Even when inputting audio clips with emotions, the generated facial animations are often neutral. 2. **Model applicability issues**: The images generated by the existing generation methods are not suitable for directly driving virtual character models, especially for advanced 3D models such as MetaHuman. 3. **Limitations of data sets**: At present, most emotion - related audio - visual data sets are recorded in English, while Chinese has significant differences in pronunciation and emotional expression. Models trained with these data sets may not be able to accurately generate facial animations corresponding to Chinese audio. In addition, the existing data sets are difficult to be directly used for model training because they lack the mapping relationship with 3D model controller values. To solve the above problems, the paper proposes EmoFace, a new method for driving virtual characters based on audio and emotional inputs. The main contributions of this method include: - Constructing a Chinese audio - visual data set containing multiple emotions and extracting the controller values of each frame. - Proposing a basic model that can generate MetaHuman controller parameters with multiple emotions, thereby achieving high - quality facial animations. - Introducing independent blink and eye - movement control modules, which enhance the naturalness and authenticity of facial expressions. Through these innovations, EmoFace can not only achieve lip - synchronization when generating 3D facial animations, but also express rich emotions and generate blinks and eye movements naturally, thereby significantly improving the realism of virtual characters and user experience.

EmoFace: Audio-driven Emotional 3D Face Animation

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion

EAT-Face: Emotion-Controllable Audio-Driven Talking Face Generation Via Diffusion Model

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion

Audio-Driven Emotional Video Portraits

Emotional Voice Puppetry

Voicing Your Emotion: Integrating Emotion and Identity in Cross-Modal 3D Facial Animations

ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model

DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling

Audio-Driven Emotional 3D Talking-Head Generation

Real-time Conversion from a Single 2D Face Image to a 3D Text-Driven Emotive Audio-Visual Avatar

Expressive 3D Facial Animation Generation Based on Local-to-Global Latent Diffusion