Abstract:Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at <a class="link-external link-https" href="https://adamesh.github.io" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing voice - driven 3D facial animation techniques overlook personalized speaking styles (including facial expressions and head postures) during the generation process. Specifically: 1. **Lack of personalized speaking styles**: Most of the existing works mainly focus on improving the synchronization between voice and lip movements, while ignoring the individual - specific speaking styles, such as facial expressions and head postures. 2. **Poor performance due to limited training data**: Some methods that attempt to capture personalized styles are prone to catastrophic forgetting and over - fitting problems due to limited training data, especially in terms of facial expressions. This leads to a significant decrease in lip synchronization and expression richness for unseen voice inputs. 3. **Averaging problem in head posture generation**: Voice, as a weak control signal, is difficult to well control the diversity of head postures, resulting in predicted head postures lacking variation and having small movement amplitudes. To solve these problems, the authors propose AdaMesh, an adaptive voice - driven 3D facial animation method. AdaMesh addresses the above challenges in the following ways: - **Expression Adapter**: Use the Mixture of Low - Rank Adaptations (MoLoRA) technique to fine - tune the expression adapter, thereby efficiently capturing facial expression styles from a small amount of data. - **Pose Adapter**: By constructing discrete pose priors and using a semantic - aware pose - style matrix to retrieve appropriate style embeddings, diverse head postures can be generated without fine - tuning parameters. Through these innovations, AdaMesh can not only generate facial animations synchronized with voice, but also retain the personalized speaking styles in the reference video and generate more vivid virtual characters. Experimental results show that AdaMesh outperforms existing methods on multiple quantitative and qualitative metrics. ### Formula presentation 1. **MoLoRA parameter update formula**: \[ W = W_0+\Delta W = W_0+\sum_{i = 0}^{N}W_i^BW_i^A \] where \( W_0 \) is the pre - trained weight matrix, \( W_i^B\in\mathbb{R}^{m/r_i\times k, r_i\times k} \), \( W_i^A\in\mathbb{R}^{r_i\times k, n\times r_i} \), and \( \Delta W \) is the incremental weight matrix after low - rank decomposition. 2. **Semantic - aware pose - style matrix calculation formula**: \[ S_j=\frac{\sum_{i = 1}^{T'}\hat{Z}_i\cdot\delta(L_i, j)}{\sum_{i = 1}^{T'}\delta(L_i, j)}, \quad j = [1, 2,\ldots, 512] \] where \( L_i \) represents the clustering label of the \( i \)-th frame, and \( \delta \) is the Kronecker symbol, which outputs 1 when the inputs are equal and 0 otherwise. Through these methods, AdaMesh generates more rich and diverse facial expressions and head postures while maintaining high - quality lip synchronization, significantly enhancing the expressiveness of virtual characters.

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

EmoFace: Audio-driven Emotional 3D Face Animation

EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

ChatAnything: Facetime Chat with LLM-Enhanced Personas.

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

Personalized Speech-driven Expressive 3D Facial Animation Synthesis with Style Control

MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head

TalkingStyle: Personalized Speech-Driven 3D Facial Animation with Style Preservation

FusionCraft: Fusing Emotion and Identity in Cross-Modal 3D Facial Animation

Pose-Aware 3D Talking Face Synthesis using Geometry-guided Audio-Vertices Attention

3D Talking Face with Personalized Pose Dynamics