Abstract:There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pre-training with phonetic or graphemic transcription, and self-supervised pre-training. We find that pre-training with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training <a class="link-external link-http" href="http://efficiency.It" rel="external noopener nofollow">this http URL</a> is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we will release the code, models and data for the whole pipeline of Whistle at <a class="link-external link-https" href="https://github.com/thu-spmi/CAT" rel="external noopener nofollow">this https URL</a> upon publication.

MLP-HMM Two-Stage Unsupervised Training for Low-Resource Languages on Conversational Telephone Speech Recognition

Articulatory Feature Based Multilingual MLPs for Low-Resource Speech Recognition.

Cross-Lingual and Ensemble MLPs Strategies for Low-Resource Speech Recognition

Strategies for using MLP based features with limited target-language training data.

Multi-Stream Posterior Features and Combining Subspace Gmms for Low Resource Lvcsr

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

LEARNING CROSS-LINGUAL INFORMATION WITH MULTILINGUAL BLSTM FOR SPEECH SYNTHESIS OF LOW-RESOURCE LANGUAGES

Improve low-resource non-native mispronunciation detection with native speech by articulatory-based tandem feature

Low Resource Malay Dialect Automatic Speech Recognition Modeling Using Transfer Learning from a Standard Malay Model

A Semi-Supervised Complementary Joint Training Approach for Low-Resource Speech Recognition

Mandarin-English bilingual phone modeling and combining MPE based Discriminative training for cross-language speech recognition

Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training

Phone modeling and combining discriminative training for Mandarin-English bilingual speech recognition

Convolutional maxout neural networks for low-resource speech recognition

Towards Unsupervised Speech Recognition Without Pronunciation Models

Learning Phone Recognition from Unpaired Audio and Phone Sequences Based on Generative Adversarial Network

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings.

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

Hidden Markov Acoustic Modeling with Bootstrap and Restructuring for Low-Resourced Languages

Map and Relabel: Towards Almost-Zero Resource Speech Recognition.