Abstract:The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.

A case study on using speech-to-translation alignments for language documentation

SpeechAlign: a Framework for Speech Translation Alignment Evaluation

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

AlloST: Low-resource Speech Translation without Source Transcription

A multilingual training strategy for low resource Text to Speech

Towards speech-to-text translation without speech recognition

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

Combining Many Alignments for Speech to Speech Translation

Optimizing Data Usage for Low-Resource Speech Recognition

Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens

Exploring Effective Data Utilization for Low-Resource Speech Recognition

Noisy Parallel Data Alignment

Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation

Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases

Improving Joint Speech-Text Representations Without Alignment

Performance Improvements of Probabilistic Transcript-adapted ASR with Recurrent Neural Network and Language-specific Constraints

AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation

Aligning Speech to Languages to Enhance Code-switching Speech Recognition

That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Feasibility of Post-Editing Speech Transcriptions with a Mismatched Crowd

Improving Speech-to-Speech Translation Through Unlabeled Text