Abstract:Modern automatic speech recognition (ASR) systems require large amounts of data to train the acoustic model, especially with the state-of-the-art deep neural network (DNN) architecture. Unfortunately, most of the languages in the world have very limited accumulating for data resources, limiting the application of ASR technologies in these languages. The state-of-the-art approach to tackle this problem is transfer learning, by which DNNs trained with data of a rich-resource language can be reused by low-resource language systems, in the form of either feature extractor or initial model. This approach, however, still requires several hours of speech, which is still not affordable for many languages. In this study, we present a novel Map and Relabel (MaR) approach that can train ASR systems for new languages with only a few hundred labelled utterances. This approach combines transfer learning and semi-supervised learning in a boosting manner: it firstly trains a simple monophone DNN based on the limited training data, employing the popular transfer learning approach (Map phase); this model is then used to produce pseudo phone labels for a large amount of untranscribed speech (Relabel phase). These pseudo-labelled data are then used to train a full-fledged tri-phone system. Experiments on Uyghur, a major minority language in the western China, demonstrates that this MaR approach is rather successful: it can train a pretty good ASR Uyghur system by only 500 utterances. This encouraging results indicate that it is possible to quickly construct a reasonable ASR system for any language, and the only effort we need to pay is just labelling several hundred utterances.

M2ASR-MONGO: A Free Mongolian Speech Database and Accompanied Baselines

A free Kazakh speech database and a speech recognition baseline.

MnTTS2: An Open-Source Multi-Speaker Mongolian Text-to-Speech Synthesis Dataset

M2ASR: Ambitions and First Year Progress

MnTTS: An Open-Source Mongolian Text-to-Speech Synthesis Dataset and Accompanied Baseline

Free Linguistic and Speech Resources for Tibetan

Low-Resource Mongolian Speech Synthesis Based on Automatic Prosody Annotation

Language resource construction for Mongolian.

Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: an Overview

Map and Relabel: Towards Almost-Zero Resource Speech Recognition.

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Almost Unsupervised Text to Speech and Automatic Speech Recognition

EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech

OC16-CE80: A Chinese-English Mixlingual Database and A Speech Recognition Baseline

Selection of acoustic modeling unit for Tibetan speech recognition based on deep learning

Anatomy of Industrial Scale Multilingual ASR

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines