TalkLoRA: Low-Rank Adaptation for Speech-Driven Animation

Jack Saunders,Vinay Namboodiri
2024-08-25
Abstract:Speech-driven facial animation is important for many applications including TV, film, video games, telecommunication and AR/VR. Recently, transformers have been shown to be extremely effective for this task. However, we identify two issues with the existing transformer-based models. Firstly, they are difficult to adapt to new personalised speaking styles and secondly, they are slow to run for long sentences due to the quadratic complexity of the transformer. We propose TalkLoRA to address both of these issues. TalkLoRA uses Low-Rank Adaptation to effectively and efficiently adapt to new speaking styles, even with limited data. It does this by training an adaptor with a small number of parameters for each subject. We also utilise a chunking strategy to reduce the complexity of the underlying transformer, allowing for long sentences at inference time. TalkLoRA can be applied to any transformer-based speech-driven animation method. We perform extensive experiments to show that TalkLoRA archives state-of-the-art style adaptation and that it allows for an order-of-complexity reduction in inference times without sacrificing quality. We also investigate and provide insights into the hyperparameter selection for LoRA fine-tuning of speech-driven facial animation models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve two main problems: 1. **Adaptability of personalized speaking styles**: Existing Transformer - based voice - driven facial animation models have difficulty adapting to new personalized speaking styles. This means that when dealing with new users or new characters, these models cannot well capture and reproduce their unique voice features and facial expressions. 2. **Slow inference speed for long sentences**: Since the time complexity of the Transformer model is \(O(N^2)\), where \(N\) is the length of the animation sequence, this makes existing models very slow in inference when processing long sentences. Specifically, when generating facial expressions at time \(t\), the model will consider all audio information from \(0\) to \(t - 1\), which not only increases the computational burden, but also, for the facial animation task, this full - history - dependence is unnecessary and unreasonable. To solve these problems, the authors propose the **TalkLoRA** method, which specifically includes the following two key improvements: - **Low - Rank Adaptation (LoRA)**: By introducing an adapter with a small number of parameters, TalkLoRA can efficiently adapt to new speaking styles and achieve good results even with a limited amount of data. This method avoids the over - fitting risk brought by fine - tuning the entire model and can quickly adapt to new identities. - **Chunking Strategy**: To improve the inference speed of long sentences, TalkLoRA adopts a chunking strategy, which divides the input audio into fixed - size overlapping chunks for parallel processing. This can significantly reduce the computational complexity, enabling the model to process longer audio sequences while maintaining high quality. Through these two improvements, TalkLoRA not only improves the adaptability and inference efficiency of the model, but also is applicable to any Transformer - based voice - driven facial animation model.