EmoFace: Audio-driven Emotional 3D Face Animation

Chang Liu,Qunfen Lin,Zijiao Zeng,Ye Pan
DOI: https://doi.org/10.1109/VR58804.2024.00060
2024-07-17
Abstract:Audio-driven emotional 3D face animation aims to generate emotionally expressive talking heads with synchronized lip movements. However, previous research has often overlooked the influence of diverse emotions on facial expressions or proved unsuitable for driving MetaHuman models. In response to this deficiency, we introduce EmoFace, a novel audio-driven methodology for creating facial animations with vivid emotional dynamics. Our approach can generate facial expressions with multiple emotions, and has the ability to generate random yet natural blinks and eye movements, while maintaining accurate lip synchronization. We propose independent speech encoders and emotion encoders to learn the relationship between audio, emotion and corresponding facial controller rigs, and finally map into the sequence of controller values. Additionally, we introduce two post-processing techniques dedicated to enhancing the authenticity of the animation, particularly in blinks and eye movements. Furthermore, recognizing the scarcity of emotional audio-visual data suitable for MetaHuman model manipulation, we contribute an emotional audio-visual dataset and derive control parameters for each frames. Our proposed methodology can be applied in producing dialogues animations of non-playable characters (NPCs) in video games, and driving avatars in virtual reality environments. Our further quantitative and qualitative experiments, as well as an user study comparing with existing researches show that our approach demonstrates superior results in driving 3D facial models. The code and sample data are available at <a class="link-external link-https" href="https://github.com/SJTU-Lucy/EmoFace" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when generating 3D facial animations, how to make the animations not only synchronized with the audio, but also express rich emotions, and be able to generate blinks and eye movements naturally, thereby improving the realism and interactive experience of virtual characters. Specifically, the main challenges mentioned in the paper include: 1. **Lack of emotional expression**: Although the existing audio - driven facial animation technologies can achieve good lip - synchronization effects, they usually lack emotional expression. Even when inputting audio clips with emotions, the generated facial animations are often neutral. 2. **Model applicability issues**: The images generated by the existing generation methods are not suitable for directly driving virtual character models, especially for advanced 3D models such as MetaHuman. 3. **Limitations of data sets**: At present, most emotion - related audio - visual data sets are recorded in English, while Chinese has significant differences in pronunciation and emotional expression. Models trained with these data sets may not be able to accurately generate facial animations corresponding to Chinese audio. In addition, the existing data sets are difficult to be directly used for model training because they lack the mapping relationship with 3D model controller values. To solve the above problems, the paper proposes EmoFace, a new method for driving virtual characters based on audio and emotional inputs. The main contributions of this method include: - Constructing a Chinese audio - visual data set containing multiple emotions and extracting the controller values of each frame. - Proposing a basic model that can generate MetaHuman controller parameters with multiple emotions, thereby achieving high - quality facial animations. - Introducing independent blink and eye - movement control modules, which enhance the naturalness and authenticity of facial expressions. Through these innovations, EmoFace can not only achieve lip - synchronization when generating 3D facial animations, but also express rich emotions and generate blinks and eye movements naturally, thereby significantly improving the realism of virtual characters and user experience.