Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Yingting Li,Ambuj Mehrish,Bryan Chew,Bo Cheng,Soujanya Poria
2024-06-25
Abstract:Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better performance compared to full fine-tuning with only $\sim$2.5\% tunable parameters.The code and samples are available at: https://anonymous.4open.science/r/multilingualTTS-BA4C.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve several key problems in multilingual text - to - speech synthesis (TTS): 1. **Differences in phonological and prosodic features of different languages**: Different languages have unique phonological structures and prosodic features, which makes it very challenging to develop TTS models that can effectively handle multilingual environments. For example, some languages may have more complex tone systems, while other languages may be richer in intonation changes. 2. **Model efficiency and deployment feasibility**: TTS models need not only to efficiently capture the nuances of multiple languages but also to remain efficient in practical applications. This means that the model must minimize the consumption of computational resources while maintaining high - quality speech synthesis. 3. **Fine - tuning cost of large - scale models**: As the model scale increases, the traditional full - parameter fine - tuning method becomes impractical because of its high computational cost. Especially for low - resource languages or under - represented dialects, due to the lack of sufficient training data, full - parameter fine - tuning may lead to poor generalization performance. To address these challenges, the paper proposes a method based on parameter - efficient transfer learning (PETL) to improve multilingual TTS models by integrating adapters and hypernetworks. Specifically, the main contributions of the paper include: - **Regular adapters and dynamic adapters**: By embedding language - specific parameters in pre - trained models (such as SpeechT5), using regular adapters, and exploring the method of using hypernetworks to generate these parameters, called HyperGenerator. - **Parameter efficiency**: It can achieve performance comparable to or even better than full - parameter fine - tuning using only about 2.44% of the parameters. - **Zero - shot performance improvement**: HyperGenerator performs better than full - parameter fine - tuning and regular adapters on unseen languages with the same number of parameters. Through these methods, the paper aims to improve the efficiency and adaptability of multilingual TTS models, especially in resource - limited situations, to achieve high - quality speech synthesis.