Abstract:Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better performance compared to full fine-tuning with only $\sim$2.5\% tunable parameters.The code and samples are available at: https://anonymous.4open.science/r/multilingualTTS-BA4C.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multilingual text - to - speech synthesis (TTS): 1. **Differences in phonological and prosodic features of different languages**: Different languages have unique phonological structures and prosodic features, which makes it very challenging to develop TTS models that can effectively handle multilingual environments. For example, some languages may have more complex tone systems, while other languages may be richer in intonation changes. 2. **Model efficiency and deployment feasibility**: TTS models need not only to efficiently capture the nuances of multiple languages but also to remain efficient in practical applications. This means that the model must minimize the consumption of computational resources while maintaining high - quality speech synthesis. 3. **Fine - tuning cost of large - scale models**: As the model scale increases, the traditional full - parameter fine - tuning method becomes impractical because of its high computational cost. Especially for low - resource languages or under - represented dialects, due to the lack of sufficient training data, full - parameter fine - tuning may lead to poor generalization performance. To address these challenges, the paper proposes a method based on parameter - efficient transfer learning (PETL) to improve multilingual TTS models by integrating adapters and hypernetworks. Specifically, the main contributions of the paper include: - **Regular adapters and dynamic adapters**: By embedding language - specific parameters in pre - trained models (such as SpeechT5), using regular adapters, and exploring the method of using hypernetworks to generate these parameters, called HyperGenerator. - **Parameter efficiency**: It can achieve performance comparable to or even better than full - parameter fine - tuning using only about 2.44% of the parameters. - **Zero - shot performance improvement**: HyperGenerator performs better than full - parameter fine - tuning and regular adapters on unseen languages with the same number of parameters. Through these methods, the paper aims to improve the efficiency and adaptability of multilingual TTS models, especially in resource - limited situations, to achieve high - quality speech synthesis.

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

Parameter-Efficient Transfer Learning for NLP

Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization

VoiceTailor: Lightweight Plug-In Adapter for Diffusion-Based Personalized Text-to-Speech

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Enhancing Multilingual Speech Recognition through Language Prompt Tuning and Frame-Level Language Adapter

Mandarin Text-to-Speech Front-End with Lightweight Distilled Convolution Network

Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding

A Parameter-efficient Language Extension Framework for Multilingual ASR

One Network, Many Masks: Towards More Parameter-Efficient Transfer Learning

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition

Adapting TTS models For New Speakers using Transfer Learning

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Parameter-efficient Tuning for Large Language Model Without Calculating Its Gradients