Abstract:In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up to 5% word error rate reduction (WERR). In conclusion, we demonstrate that by incorporating a modest fraction of real with synthetically generated data, the ASR system exhibits superior performance compared to a model trained exclusively on authentic accented speech with up to 14% WERR.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance degradation of automatic speech recognition (ASR) systems when processing low - resource speech data with accents. Specifically, when the training data lacks data of specific accents or specific acoustic environments, the performance of ASR systems will decline significantly. To solve this problem, the author proposes a method based on zero - shot text - to - speech (ZS - TTS) to enhance the speech data set with accents. ### Main problems 1. **Low - resource and under - represented accent data**: Existing ASR systems usually rely on a large amount of labeled training data, and in some specific accents or acoustic environments, this data may be insufficient. 2. **Insufficient robustness of ASR systems to non - native accents**: When ASR systems are mainly trained with data of native accents, they perform poorly when processing non - native - accented speech. 3. **How to effectively use synthetic speech data**: Generate synthetic speech data to enhance the training set, thereby improving the performance of ASR systems when processing accented speech. ### Solutions The author proposes an enhancement method based on ZS - TTS, aiming to supplement real speech data by generating synthetic speech data, thereby improving the performance of ASR systems when processing accented speech. The specific steps are as follows: - **Generate synthetic speech data using ZS - TTS**: Generate synthetic speech data with different accents through the ZS - TTS system. - **Mix real and synthetic data for training**: Mix the generated synthetic speech data with real speech data for training and fine - tuning the ASR model. - **Evaluate performance improvement**: Verify the effectiveness of this method through experiments, especially its performance when processing English speech with Indian accents. ### Experimental results The experimental results show that by introducing synthetic speech data generated by ZS - TTS, the performance of ASR systems has been significantly improved. In particular, when processing English speech with Indian accents, the word error rate (WER) has been reduced by about 5%. Moreover, when the ASR model is only fine - tuned with synthetic speech data, it also shows better performance than when only using real data. In conclusion, this paper has successfully alleviated the performance degradation problem of ASR systems when processing low - resource and accented speech data by introducing synthetic speech data generated by ZS - TTS.

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

ASR data augmentation in low-resource settings using cross-lingual multi-speaker TTS and cross-lingual voice conversion

Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition

Speech Synthesis as Augmentation for Low-Resource ASR

Exploring the Role of Data Augmentation and Acoustic Feature Concatenation in the Context of Zero-Resource Children's ASR

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models

Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models

Improving Low Resource Code-switched ASR using Augmented Code-switched TTS

Training Data Augmentation for Dysarthric Automatic Speech Recognition by Text-to-Dysarthric-Speech Synthesis

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Exploring Speech Enhancement for Low-resource Speech Synthesis

Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR

A multilingual training strategy for low resource Text to Speech

ChildAugment: Data Augmentation Methods for Zero-Resource Children's Speaker Verification