Abstract:Abstract Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

External Text Based Data Augmentation for Low-Resource Speech Recognition in the Constrained Condition of OpenASR21 Challenge

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

The NTU-AISG Text-to-speech System for Blizzard Challenge 2020

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

Data Augmentation For Children's Speech Recognition -- The "Ethiopian" System For The SLT 2021 Children Speech Recognition Challenge

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

Improving fairness for spoken language understanding in atypical speech with Text-to-Speech

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Exploring Speech Enhancement for Low-resource Speech Synthesis

Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding

An efficient text augmentation approach for contextualized Mandarin speech recognition

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge

Text Generation with Speech Synthesis for ASR Data Augmentation

The NPU-ASLP System for The ISCSLP 2022 Magichub Code-Swiching ASR Challenge

Creating Spoken Dialog Systems in Ultra-Low Resourced Settings