Abstract:As the development of deep learning, neural network (NN) based text-to-speech (TTS) that adopts deep neural networks as the model backbone for speech synthesis, has now become the mainstream technology for TTS. Compared to the previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the NN based speech synthesis shows conspicuous advantages. It needs less requirement on human pre-processing and feature development, and brings high-quality voice in terms of both intelligibility and naturalness. However, robust NN based speech synthesis model typically requires a sizable set of high-quality data for training, which is expensive to collect especially in low-resource scenarios. It is worth investigating how to take advantage of low-quality material such as automatic speech recognition (ASR) data which can be easily obtained compared with high-quality TTS material. In this paper, we propose a pre-training technique framework to improve the performance of low-resource speech synthesis. The idea is to extend the training material of TTS model by using ASR based data augmentation method. Specifically, we first build a frame-wise phoneme classification network on the ASR dataset and extract the semi-supervised <linguistic features, audio> paired data from large-scale speech corpora. We then pre-train the NN based TTS acoustic model by using the semi-supervised <linguistic features, audio> pairs. Finally, we fine-tune the model with a small amount of available paired data. Experimental results show that our proposed framework enables the TTS model to generate more intelligible and natural speech with the same amount of paired training data.

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Auditory-Based Data Augmentation for End-to-End Automatic Speech Recognition

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

Speaker Augmentation for Low Resource Speech Recognition

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

TTS-by-TTS: TTS-Driven Data Augmentation for Fast and High-Quality Speech Synthesis

Improving Speech Recognition with Augmented Synthesized Data and Conditional Model Training

SCADA: Stochastic, Consistent and Adversarial Data Augmentation to Improve ASR

Effective Data Augmentation Methods for Neural Text-to-Speech Systems

Data Augmentation for End-to-end Code-switching Speech Recognition

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Improving Low Resource Code-switched ASR using Augmented Code-switched TTS

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Towards Selection of Text-to-speech Data to Augment ASR Training