Abstract:The performance of automatic speech recognition (ASR) systems has advanced substantially in recent years, particularly for languages for which a large amount of transcribed speech is available. Unfortunately, for low-resource languages, such as minority languages, regional languages or dialects, ASR performance generally remains much lower. In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). For all four languages, we examine the use of self-training, where an ASR system trained with the available human-transcribed data is used to generate transcriptions, which are then combined with the original data to train a new ASR system. For Gronings, for which there was a pre-existing text-to-speech (TTS) system available, we also examined the use of TTS to generate ASR training data from text-only sources. We find that using a self-training approach consistently yields improved performance (a relative WER reduction up to 20.5% compared to using an ASR system trained on 24 minutes of manually transcribed speech). The performance gain from TTS augmentation for Gronings was even stronger (up to 25.5% relative reduction in WER compared to a system based on 24 minutes of manually transcribed speech). In sum, our results show the benefit of using self-training or (if possible) TTS-generated data as an efficient solution to overcome the limitations of data availability for resource-scarce languages in order to improve ASR performance.

Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks

Automatic Lexical Stress Detection for English Learning

End-to-end Mispronunciation Detection with Simulated Error Distance

Stress Accent Detection in an English Learning System

Automatic Stress Exaggeration By Prosody Modification To Assist Language Learners Perceive Sentence Stress

Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis

Evaluating the Impact of Discriminative and Generative E2E Speech Enhancement Models on Syllable Stress Preservation

Detecting Syllable-Level Pronunciation Stress with A Self-Attention Model

Automated detection of pronunciation errors in non-native English speech employing deep learning

English Sentence Stress Detection System Based on HMM Framework

Weakly-supervised word-level pronunciation error detection in non-native English speech

Automatic lexical stress detection using acoustic features for computer-assisted language learning

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Synthetic Cross-accent Data Augmentation for Automatic Speech Recognition

A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques

Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Data augmentation using prosody and false starts to recognize non-native children's speech

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora

Generative Deep Learning and Signal Processing for Data Augmentation of Cardiac Auscultation Signals: Improving Model Robustness Using Synthetic Audio