ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations

Xulong Zhang,Xiaoyang Qu,Haoxiang Shi,Chunguang Xiao,Jianzong Wang
2024-11-20
Abstract:This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework in generating high-quality, emotionally rich 3D animations that are better aligned with human preferences.
Computer Vision and Pattern Recognition,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to address the shortcomings of existing 3D speech - driven facial animation (STA) models in generating diverse and emotionally - rich animations. Specifically, the animations generated by current STA models often lack emotional depth and diversity and cannot well meet human expectations. To solve these problems, the paper proposes a new framework named ESARM (Emotion - aware Speech - to - Animation via Reward Model from Automatically - Ranked Demonstrations). ### Main problems: 1. **Lack of emotional depth and diversity**: Animations generated by existing STA models usually lack emotional expression, resulting in less vivid and realistic animation effects. 2. **Lip - sync and overall facial expressions are not coordinated**: Existing methods mainly focus on lip - sync and ignore the subtle dynamics of other facial expressions, which may lead to the "uncanny valley" effect, that is, the animation looks less realistic. 3. **Difficulties in emotional decoupling**: Separating the explicit content in speech from the implicit emotional information is a major challenge, especially in 2D facial animations. Manual encoding of emotions easily leads to inconsistencies between the input speech and facial expressions. ### Solutions: To overcome these limitations, the ESARM framework introduces the following innovations: 1. **Emotional decoupling**: Through the cross - coupling training method, ESARM can decouple the emotion and content in speech, thereby generating more diverse and emotionally - rich face animations. 2. **Reinforcement learning based on reward models**: ESARM uses automatically - ranked demonstration data to train the reward model to guide the reinforcement learning process, enabling the STA model to explore a wider range of possibilities under audio conditions and generate high - quality, emotionally - rich 3D face animations. 3. **Automated quality assessment**: A training method has been developed to guide the reinforcement learning process by automatically evaluating the quality of the generated facial animations, ensuring that the generated animations are not only technically accurate but also meet human aesthetic standards in terms of emotional expression. ### Formula representation: When describing the ESARM framework, some key formulas and model structures are involved. For example, in the reinforcement learning stage, the goal is to maximize the expected return, and the formula is as follows: \[ \phi = \arg \max_{\phi} E_{\tau \sim \pi_{\phi}}[R(\tau)] = \arg \max_{\phi} \sum_{\tau} p_{\tau \sim \pi_{\phi}} R(\tau) \] where \( p_{\tau \sim \pi_{\phi}}(\tau) \) is the probability of generating a motion sequence \( \tau \) under the policy \( \pi_{\phi} \), and \( R(\tau) = \sum_{t = 0}^{T - 1} \gamma^t R(s_t, a_t) \) is the discounted return. In addition, the reward model is trained with pairwise ranking loss to distinguish between high - quality and low - quality facial animations: \[ L_{RM} = -E_{(\tau_i, \tau_j, y) \in D} \left[(1 - y) \log R(\tau_i \prec \tau_j; \theta) + y \log R(\tau_j \prec \tau_i; \theta)\right] \] where \( y=\text{int}(\epsilon_i < \epsilon_j) \), and \( R(\tau_i \prec \tau_j; \theta) \) is the ranking predictor. Through these innovations, ESARM aims to generate high - quality, emotionally - rich 3D facial animations that are more in line with human expectations.