Abstract:Modern automatic speech recognition (ASR) systems require large amounts of data to train the acoustic model, especially with the state-of-the-art deep neural network (DNN) architecture. Unfortunately, most of the languages in the world have very limited accumulating for data resources, limiting the application of ASR technologies in these languages. The state-of-the-art approach to tackle this problem is transfer learning, by which DNNs trained with data of a rich-resource language can be reused by low-resource language systems, in the form of either feature extractor or initial model. This approach, however, still requires several hours of speech, which is still not affordable for many languages. In this study, we present a novel Map and Relabel (MaR) approach that can train ASR systems for new languages with only a few hundred labelled utterances. This approach combines transfer learning and semi-supervised learning in a boosting manner: it firstly trains a simple monophone DNN based on the limited training data, employing the popular transfer learning approach (Map phase); this model is then used to produce pseudo phone labels for a large amount of untranscribed speech (Relabel phase). These pseudo-labelled data are then used to train a full-fledged tri-phone system. Experiments on Uyghur, a major minority language in the western China, demonstrates that this MaR approach is rather successful: it can train a pretty good ASR Uyghur system by only 500 utterances. This encouraging results indicate that it is possible to quickly construct a reasonable ASR system for any language, and the only effort we need to pay is just labelling several hundred utterances.

Indonesian Automatic Speech Recognition with XLSR-53

XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese

Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Improved low-resource Somali speech recognition by semi-supervised acoustic and language model training

Anatomy of Industrial Scale Multilingual ASR

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Crossing language identification: Multilingual ASR framework based on semantic dataset creation & Wav2Vec 2.0

Semi-supervised acoustic and language model training for English-isiZulu code-switched speech recognition

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

Towards Building ASR Systems for the Next Billion Users

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis

Map and Relabel: Towards Almost-Zero Resource Speech Recognition.

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding

Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

End-to-end indonesian speech recognition with convolutional and gated recurrent units