Abstract:Abstract Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

Text-Inductive Graphone-Based Language Adaptation for Low-Resource Speech Synthesis

Rapid Speaker Adaptation in Low Resource Text to Speech Systems using Synthetic Data and Transfer learning

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations

A multilingual training strategy for low resource Text to Speech

Transfer Learning for Low-Resource, Multi-Lingual, and Zero-Shot Multi-Speaker Text-to-Speech

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

Extending Multilingual Speech Synthesis to 100+ Languages without Transcribed Data

An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios

End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning.

Transfer Learning, Style Control, and Speaker Reconstruction Loss for Zero-Shot Multilingual Multi-Speaker Text-to-Speech on Low-Resource Languages

Adapting TTS models For New Speakers using Transfer Learning

Code-Mixed Text to Speech Synthesis under Low-Resource Constraints

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus