Liyang Chen,Weihong Bao,Shun Lei,Boshi Tang,Zhiyong Wu,Shiyin Kang,Haozhi Huang,Helen Meng
Abstract:Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at <a class="link-external link-https" href="https://adamesh.github.io" rel="external noopener nofollow">this https URL</a>.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing voice - driven 3D facial animation techniques overlook personalized speaking styles (including facial expressions and head postures) during the generation process. Specifically:
1. **Lack of personalized speaking styles**: Most of the existing works mainly focus on improving the synchronization between voice and lip movements, while ignoring the individual - specific speaking styles, such as facial expressions and head postures.
2. **Poor performance due to limited training data**: Some methods that attempt to capture personalized styles are prone to catastrophic forgetting and over - fitting problems due to limited training data, especially in terms of facial expressions. This leads to a significant decrease in lip synchronization and expression richness for unseen voice inputs.
3. **Averaging problem in head posture generation**: Voice, as a weak control signal, is difficult to well control the diversity of head postures, resulting in predicted head postures lacking variation and having small movement amplitudes.
To solve these problems, the authors propose AdaMesh, an adaptive voice - driven 3D facial animation method. AdaMesh addresses the above challenges in the following ways:
- **Expression Adapter**: Use the Mixture of Low - Rank Adaptations (MoLoRA) technique to fine - tune the expression adapter, thereby efficiently capturing facial expression styles from a small amount of data.
- **Pose Adapter**: By constructing discrete pose priors and using a semantic - aware pose - style matrix to retrieve appropriate style embeddings, diverse head postures can be generated without fine - tuning parameters.
Through these innovations, AdaMesh can not only generate facial animations synchronized with voice, but also retain the personalized speaking styles in the reference video and generate more vivid virtual characters. Experimental results show that AdaMesh outperforms existing methods on multiple quantitative and qualitative metrics.
### Formula presentation
1. **MoLoRA parameter update formula**:
\[
W = W_0+\Delta W = W_0+\sum_{i = 0}^{N}W_i^BW_i^A
\]
where \( W_0 \) is the pre - trained weight matrix, \( W_i^B\in\mathbb{R}^{m/r_i\times k, r_i\times k} \), \( W_i^A\in\mathbb{R}^{r_i\times k, n\times r_i} \), and \( \Delta W \) is the incremental weight matrix after low - rank decomposition.
2. **Semantic - aware pose - style matrix calculation formula**:
\[
S_j=\frac{\sum_{i = 1}^{T'}\hat{Z}_i\cdot\delta(L_i, j)}{\sum_{i = 1}^{T'}\delta(L_i, j)}, \quad j = [1, 2,\ldots, 512]
\]
where \( L_i \) represents the clustering label of the \( i \)-th frame, and \( \delta \) is the Kronecker symbol, which outputs 1 when the inputs are equal and 0 otherwise.
Through these methods, AdaMesh generates more rich and diverse facial expressions and head postures while maintaining high - quality lip synchronization, significantly enhancing the expressiveness of virtual characters.