JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Language Model

Farzaneh Jafari,Stefano Berretti,Anup Basu
2024-08-03
Abstract:In recent years, talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high video quality. However, no single model has yet achieved equivalence across all these metrics. This paper aims to animate a 3D face using Jamba, a hybrid Transformers-Mamba model. Mamba, a pioneering Structured State Space Model (SSM) architecture, was designed to address the constraints of the conventional Transformer architecture. Nevertheless, it has several drawbacks. Jamba merges the advantages of both Transformer and Mamba approaches, providing a holistic solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and speed through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper aims to address multiple challenges in 3D talking - head generation, especially how to achieve high - fidelity lip - sync, natural head poses, rich facial expressions, and high - quality video output during the generation process. Currently, although there have been many studies dedicated to improving performance in these aspects, no single model has been able to achieve optimality on all of these metrics. Therefore, this paper proposes a new hybrid model - JambaTalk, which combines the advantages of the Transformer and Mamba models in the hope of achieving better results in 3D talking - head generation. ### Main Contributions 1. **Introduction of the JambaTalk Framework**: This is an innovative framework for speech - based 3D talking - head generation. The framework combines multiple Mamba, MoE - Mamba, and Transformer layers to improve generation performance. By adjusting the order of the Mamba and MoE - Mamba layers, the results are further enhanced. 2. **Utilization of the Rotary Position Embedding (RoPE) and Grouped Query Attention (GQA) Algorithms**: These techniques are used to enhance the performance of the Transformer layer, especially when dealing with long sequences. 3. **Extensive Experimental Verification**: Experiments on the Vocaset dataset show that the proposed model is comparable to or better than the existing state - of - the - art models in performance. ### Method Overview The goal of the JambaTalk model is to generate continuous 3D facial animations from the original audio input and the previous facial motion sequence. Specifically, the model includes the following main parts: 1. **Audio Encoder**: Use the pre - trained Wav2Vec 2.0 model to extract audio features. These features are transformed into contextualized speech representations through a multi - layer Transformer encoder. 2. **JambaTalk Decoder**: Based on the Jamba model, it combines the advantages of the Transformer and Mamba architectures. By introducing the Mixture of Experts (MoE) mechanism in specific layers, the performance of the model is improved while keeping the amount of active parameters used within a controllable range. 3. **Selective State - Space Layers**: Three Mamba layers are applied on both sides of the Transformer layer. Mamba is a structured state - space sequence model that improves prediction performance by dynamically selecting key input segments. 4. **Mixture of Experts (MoE) Layers**: By routing the input to top - level experts, the expressiveness and efficiency of the model are improved. 5. **Rotary Position Embedding (RoPE)**: Encodes absolute position information through a rotation matrix and directly integrates relative position dependencies into the self - attention mechanism. 6. **Grouped Query Attention (GQA)**: Achieves a quality comparable to that of Multi - Head Attention (MHA) through an intermediate number of key - value heads while maintaining a speed similar to that of Multi - Query Attention (MQA). ### Experimental Results - **Quantitative Evaluation**: The experimental results on the Vocaset dataset show that JambaTalk outperforms other existing methods in terms of lip - vertex error (LVE) and upper - face dynamic deviation (FDD). - **Qualitative Analysis**: Through visual evaluation, the 3D talking - heads generated by JambaTalk perform well in lip - sync and overall facial dynamics and can capture rich facial expressions and natural head movements. ### Conclusion The JambaTalk model successfully addresses multiple challenges in 3D talking - head generation by combining the advantages of the Transformer and Mamba models and introducing multiple optimization techniques. The experimental results show that the model has significant advantages in generating high - quality, natural 3D talking - heads.