Abstract:As the development of deep learning, neural network (NN) based text-to-speech (TTS) that adopts deep neural networks as the model backbone for speech synthesis, has now become the mainstream technology for TTS. Compared to the previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the NN based speech synthesis shows conspicuous advantages. It needs less requirement on human pre-processing and feature development, and brings high-quality voice in terms of both intelligibility and naturalness. However, robust NN based speech synthesis model typically requires a sizable set of high-quality data for training, which is expensive to collect especially in low-resource scenarios. It is worth investigating how to take advantage of low-quality material such as automatic speech recognition (ASR) data which can be easily obtained compared with high-quality TTS material. In this paper, we propose a pre-training technique framework to improve the performance of low-resource speech synthesis. The idea is to extend the training material of TTS model by using ASR based data augmentation method. Specifically, we first build a frame-wise phoneme classification network on the ASR dataset and extract the semi-supervised <linguistic features, audio> paired data from large-scale speech corpora. We then pre-train the NN based TTS acoustic model by using the semi-supervised <linguistic features, audio> pairs. Finally, we fine-tune the model with a small amount of available paired data. Experimental results show that our proposed framework enables the TTS model to generate more intelligible and natural speech with the same amount of paired training data.

Efficient Utilization of Large Pre-Trained Models for Low Resource ASR

Unsupervised Pre-Training for Vietnamese Automatic Speech Recognition in the HYKIST Project

Exploring Effective Data Utilization for Low-Resource Speech Recognition

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

Optimizing Data Usage for Low-Resource Speech Recognition

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Universal Cross-Lingual Data Generation for Low Resource ASR

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Exploiting foreign resources for DNN-based ASR

Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

Development of Hybrid ASR Systems for Low Resource Medical Domain Conversational Telephone Speech

Almost Unsupervised Text to Speech and Automatic Speech Recognition

A General Procedure for Improving Language Models in Low-Resource Speech Recognition

Towards Building ASR Systems for the Next Billion Users

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Pre-training for low resource speech-to-intent applications

Towards scalable efficient on-device ASR with transfer learning

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models