Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Francesco Nespoli,Daniel Barreda,Patrick A. Naylor
2024-09-17
Abstract:In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up to 5% word error rate reduction (WERR). In conclusion, we demonstrate that by incorporating a modest fraction of real with synthetically generated data, the ASR system exhibits superior performance compared to a model trained exclusively on authentic accented speech with up to 14% WERR.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation of automatic speech recognition (ASR) systems when processing low - resource speech data with accents. Specifically, when the training data lacks data of specific accents or specific acoustic environments, the performance of ASR systems will decline significantly. To solve this problem, the author proposes a method based on zero - shot text - to - speech (ZS - TTS) to enhance the speech data set with accents. ### Main problems 1. **Low - resource and under - represented accent data**: Existing ASR systems usually rely on a large amount of labeled training data, and in some specific accents or acoustic environments, this data may be insufficient. 2. **Insufficient robustness of ASR systems to non - native accents**: When ASR systems are mainly trained with data of native accents, they perform poorly when processing non - native - accented speech. 3. **How to effectively use synthetic speech data**: Generate synthetic speech data to enhance the training set, thereby improving the performance of ASR systems when processing accented speech. ### Solutions The author proposes an enhancement method based on ZS - TTS, aiming to supplement real speech data by generating synthetic speech data, thereby improving the performance of ASR systems when processing accented speech. The specific steps are as follows: - **Generate synthetic speech data using ZS - TTS**: Generate synthetic speech data with different accents through the ZS - TTS system. - **Mix real and synthetic data for training**: Mix the generated synthetic speech data with real speech data for training and fine - tuning the ASR model. - **Evaluate performance improvement**: Verify the effectiveness of this method through experiments, especially its performance when processing English speech with Indian accents. ### Experimental results The experimental results show that by introducing synthetic speech data generated by ZS - TTS, the performance of ASR systems has been significantly improved. In particular, when processing English speech with Indian accents, the word error rate (WER) has been reduced by about 5%. Moreover, when the ASR model is only fine - tuned with synthetic speech data, it also shows better performance than when only using real data. In conclusion, this paper has successfully alleviated the performance degradation problem of ASR systems when processing low - resource and accented speech data by introducing synthetic speech data generated by ZS - TTS.