Abstract:The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.

Speech Technology for Everyone: Automatic Speech Recognition for Non-Native English with Transfer Learning

Improving Automatic Speech Recognition for Non-Native English with Transfer Learning and Language Model Decoding

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

A multilingual training strategy for low resource Text to Speech

Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

Improving Massively Multilingual ASR With Auxiliary CTC Objectives

Improving Speech Recognition for African American English With Audio Classification

Transfer learning of language-independent end-to-end ASR with language model fusion

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

Cantonese Automatic Speech Recognition Using Transfer Learning from Mandarin

Error-preserving Automatic Speech Recognition of Young English Learners' Language

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap

Improving child speech recognition with augmented child-like speech

Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition