Abstract:Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at <a class="link-external link-https" href="https://github.com/SJTU-Lucy/EmoFace" rel="external noopener nofollow">this https URL</a>.

Mood avatar: automatic text-driven head motion synthesis

"Mood Avatar: Automatic Text-Driven Head Motion Synthesis" International Conference on Multimodal Interfaces (ICMI2010)

Text-driven Visual Prosody Generation for Embodied Conversational Agents

Head Movement Synthesis Based on Semantic and Prosodic Features for a Chinese Expressive Avatar

Emotional Head Motion Predicting from Prosodic and Linguistic Features

Emotional Chinese talking head system

Text to Avatar in Multi-modal Human Computer Interface

Head and Facial Gestures Synthesis Using PAD Model for an Expressive Talking Avatar

Animating a Chinese interactive virtual character

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Multimodal emotion estimation and emotional synthesize for interaction virtual agent

Real-time Speech-Driven Animation of Expressive Talking Faces.

Video-driven state-aware facial animation

Real-time Synthesis of Chinese Visual Speech and Facial Expressions Using MPEG-4 FAP Features in a Three-Dimensional Avatar

Real-time Conversion from a Single 2D Face Image to a 3D Text-Driven Emotive Audio-Visual Avatar

A Multimodal Approach of Generating 3D Human-Like Talking Agent.

EmoFace: Audio-driven Emotional 3D Face Animation

T3M: Text Guided 3D Human Motion Synthesis from Speech

Text/Speech-Driven Full-Body Animation

Real-time Synthesis of Chinese Visua using MPEG-4 FAP Features in a