Abstract:This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework in generating high-quality, emotionally rich 3D animations that are better aligned with human preferences.

What problem does this paper attempt to address?

This paper attempts to address the shortcomings of existing 3D speech - driven facial animation (STA) models in generating diverse and emotionally - rich animations. Specifically, the animations generated by current STA models often lack emotional depth and diversity and cannot well meet human expectations. To solve these problems, the paper proposes a new framework named ESARM (Emotion - aware Speech - to - Animation via Reward Model from Automatically - Ranked Demonstrations). ### Main problems: 1. **Lack of emotional depth and diversity**: Animations generated by existing STA models usually lack emotional expression, resulting in less vivid and realistic animation effects. 2. **Lip - sync and overall facial expressions are not coordinated**: Existing methods mainly focus on lip - sync and ignore the subtle dynamics of other facial expressions, which may lead to the "uncanny valley" effect, that is, the animation looks less realistic. 3. **Difficulties in emotional decoupling**: Separating the explicit content in speech from the implicit emotional information is a major challenge, especially in 2D facial animations. Manual encoding of emotions easily leads to inconsistencies between the input speech and facial expressions. ### Solutions: To overcome these limitations, the ESARM framework introduces the following innovations: 1. **Emotional decoupling**: Through the cross - coupling training method, ESARM can decouple the emotion and content in speech, thereby generating more diverse and emotionally - rich face animations. 2. **Reinforcement learning based on reward models**: ESARM uses automatically - ranked demonstration data to train the reward model to guide the reinforcement learning process, enabling the STA model to explore a wider range of possibilities under audio conditions and generate high - quality, emotionally - rich 3D face animations. 3. **Automated quality assessment**: A training method has been developed to guide the reinforcement learning process by automatically evaluating the quality of the generated facial animations, ensuring that the generated animations are not only technically accurate but also meet human aesthetic standards in terms of emotional expression. ### Formula representation: When describing the ESARM framework, some key formulas and model structures are involved. For example, in the reinforcement learning stage, the goal is to maximize the expected return, and the formula is as follows: \[ \phi = \arg \max_{\phi} E_{\tau \sim \pi_{\phi}}[R(\tau)] = \arg \max_{\phi} \sum_{\tau} p_{\tau \sim \pi_{\phi}} R(\tau) \] where \( p_{\tau \sim \pi_{\phi}}(\tau) \) is the probability of generating a motion sequence \( \tau \) under the policy \( \pi_{\phi} \), and \( R(\tau) = \sum_{t = 0}^{T - 1} \gamma^t R(s_t, a_t) \) is the discounted return. In addition, the reward model is trained with pairwise ranking loss to distinguish between high - quality and low - quality facial animations: \[ L_{RM} = -E_{(\tau_i, \tau_j, y) \in D} \left[(1 - y) \log R(\tau_i \prec \tau_j; \theta) + y \log R(\tau_j \prec \tau_i; \theta)\right] \] where \( y=\text{int}(\epsilon_i < \epsilon_j) \), and \( R(\tau_i \prec \tau_j; \theta) \) is the ranking predictor. Through these innovations, ESARM aims to generate high - quality, emotionally - rich 3D facial animations that are more in line with human expectations.

ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations

CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

EmotionGesture: Audio-Driven Diverse Emotional Co-Speech 3D Gesture Generation

Expressive Speech-driven Facial Animation with controllable emotions

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

2D/3D Expression Generation Using Advanced Learning Techniques and the Emotion Wheel

ECAvatar: 3D Avatar Facial Animation with Controllable Identity and Emotion

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Low-Rank Active Learning for Generating Speech-Drive Human Face Animation

Towards Rich Emotions in 3D Avatars: A Text-to-3D Avatar Generation Benchmark

EmoFace: Audio-driven Emotional 3D Face Animation

End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

EMO: Emote Portrait Alive -- Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions