Abstract:In recent years, talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high video quality. However, no single model has yet achieved equivalence across all these metrics. This paper aims to animate a 3D face using Jamba, a hybrid Transformers-Mamba model. Mamba, a pioneering Structured State Space Model (SSM) architecture, was designed to address the constraints of the conventional Transformer architecture. Nevertheless, it has several drawbacks. Jamba merges the advantages of both Transformer and Mamba approaches, providing a holistic solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and speed through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.

What problem does this paper attempt to address?

This paper aims to address multiple challenges in 3D talking - head generation, especially how to achieve high - fidelity lip - sync, natural head poses, rich facial expressions, and high - quality video output during the generation process. Currently, although there have been many studies dedicated to improving performance in these aspects, no single model has been able to achieve optimality on all of these metrics. Therefore, this paper proposes a new hybrid model - JambaTalk, which combines the advantages of the Transformer and Mamba models in the hope of achieving better results in 3D talking - head generation. ### Main Contributions 1. **Introduction of the JambaTalk Framework**: This is an innovative framework for speech - based 3D talking - head generation. The framework combines multiple Mamba, MoE - Mamba, and Transformer layers to improve generation performance. By adjusting the order of the Mamba and MoE - Mamba layers, the results are further enhanced. 2. **Utilization of the Rotary Position Embedding (RoPE) and Grouped Query Attention (GQA) Algorithms**: These techniques are used to enhance the performance of the Transformer layer, especially when dealing with long sequences. 3. **Extensive Experimental Verification**: Experiments on the Vocaset dataset show that the proposed model is comparable to or better than the existing state - of - the - art models in performance. ### Method Overview The goal of the JambaTalk model is to generate continuous 3D facial animations from the original audio input and the previous facial motion sequence. Specifically, the model includes the following main parts: 1. **Audio Encoder**: Use the pre - trained Wav2Vec 2.0 model to extract audio features. These features are transformed into contextualized speech representations through a multi - layer Transformer encoder. 2. **JambaTalk Decoder**: Based on the Jamba model, it combines the advantages of the Transformer and Mamba architectures. By introducing the Mixture of Experts (MoE) mechanism in specific layers, the performance of the model is improved while keeping the amount of active parameters used within a controllable range. 3. **Selective State - Space Layers**: Three Mamba layers are applied on both sides of the Transformer layer. Mamba is a structured state - space sequence model that improves prediction performance by dynamically selecting key input segments. 4. **Mixture of Experts (MoE) Layers**: By routing the input to top - level experts, the expressiveness and efficiency of the model are improved. 5. **Rotary Position Embedding (RoPE)**: Encodes absolute position information through a rotation matrix and directly integrates relative position dependencies into the self - attention mechanism. 6. **Grouped Query Attention (GQA)**: Achieves a quality comparable to that of Multi - Head Attention (MHA) through an intermediate number of key - value heads while maintaining a speed similar to that of Multi - Query Attention (MQA). ### Experimental Results - **Quantitative Evaluation**: The experimental results on the Vocaset dataset show that JambaTalk outperforms other existing methods in terms of lip - vertex error (LVE) and upper - face dynamic deviation (FDD). - **Qualitative Analysis**: Through visual evaluation, the 3D talking - heads generated by JambaTalk perform well in lip - sync and overall facial dynamics and can capture rich facial expressions and natural head movements. ### Conclusion The JambaTalk model successfully addresses multiple challenges in 3D talking - head generation by combining the advantages of the Transformer and Mamba models and introducing multiple optimization techniques. The experimental results show that the model has significant advantages in generating high - quality, natural 3D talking - heads.

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Language Model

Audio-driven Talking Face Video Generation with Natural Head Pose

VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior

A Multimodal Approach of Generating 3D Human-Like Talking Agent.

Breathing Life into Faces: Speech-driven 3D Facial Animation with Natural Head Pose and Detailed Shape

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis

MakeItTalk: Speaker-Aware Talking-Head Animation

Write-a-speaker: Text-based Emotional and Rhythmic Talking-head Generation

Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Cospeech body motion generation using a transformer

Transformer Network for Semantically-Aware and Speech-Driven Upper-Face Generation

3D-TalkEmo: Learning to Synthesize 3D Emotional Talking Head

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Audio-driven Talking Face Video Generation with Learning-based Personalized Head Pose

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation

Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

An Online Speech Driven Talking Head System

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time