Abstract:As the development of deep learning, neural network (NN) based text-to-speech (TTS) that adopts deep neural networks as the model backbone for speech synthesis, has now become the mainstream technology for TTS. Compared to the previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the NN based speech synthesis shows conspicuous advantages. It needs less requirement on human pre-processing and feature development, and brings high-quality voice in terms of both intelligibility and naturalness. However, robust NN based speech synthesis model typically requires a sizable set of high-quality data for training, which is expensive to collect especially in low-resource scenarios. It is worth investigating how to take advantage of low-quality material such as automatic speech recognition (ASR) data which can be easily obtained compared with high-quality TTS material. In this paper, we propose a pre-training technique framework to improve the performance of low-resource speech synthesis. The idea is to extend the training material of TTS model by using ASR based data augmentation method. Specifically, we first build a frame-wise phoneme classification network on the ASR dataset and extract the semi-supervised <linguistic features, audio> paired data from large-scale speech corpora. We then pre-train the NN based TTS acoustic model by using the semi-supervised <linguistic features, audio> pairs. Finally, we fine-tune the model with a small amount of available paired data. Experimental results show that our proposed framework enables the TTS model to generate more intelligible and natural speech with the same amount of paired training data.

A Multi-task Framework of Speaker Recognition with TTS Data Augmentation

AudioVSR: Enhancing Video Speech Recognition with Audio Data

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Speaker Augmentation for Low Resource Speech Recognition

A Comprehensive Investigation on Speaker Augmentation for Speaker Recognition

Improving Speech Recognition with Augmented Synthesized Data and Conditional Model Training

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing based Data Augmentation

Unit selection synthesis based data augmentation for fixed phrase speaker verification

Multi-task learning of deep neural networks for joint automatic speaker verification and spoofing detection

A Light CNN with Split Batch Normalization for Spoofed Speech Detection Using Data Augmentation

Multimodal Sensor-Input Architecture with Deep Learning for Audio-Visual Speech Recognition in Wild

Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

Speech Recognition with Augmented Synthesized Speech

Adaptive data augmentation for mandarin automatic speech recognition

Anti-Spoofing Speaker Verification System with Multi-Feature Integration and Multi-Task Learning

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Speaker-Aware Anti-Spoofing

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

Multi-Task Learning Improves Synthetic Speech Detection