Abstract:There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pre-training with phonetic or graphemic transcription, and self-supervised pre-training. We find that pre-training with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training <a class="link-external link-http" href="http://efficiency.It" rel="external noopener nofollow">this http URL</a> is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we will release the code, models and data for the whole pipeline of Whistle at <a class="link-external link-https" href="https://github.com/thu-spmi/CAT" rel="external noopener nofollow">this https URL</a> upon publication.

Lightly supervised acoustic model training for mandarin continuous speech recognition

STUDY ON FRAMEWORK FOR CHINESE PRONUNCIATION VARIATION MODELING

Context Dependent Syllable Acoustic Model For Continuous Chinese Speech Recognition

Advances in Cantonese Speech Recognition: A Language-Specific Pretraining Model and RNN-T Loss

Effective Acoustic Modeling for Pronunciation Quality Scoring of Strongly Accented Mandarin Speech

Mandarin Continuous Digit Speech Recognition System

Research on Context-Dependent Acoustical Unit (Triphone) for Mandarin Continuous Speech Recognition

An Innovative Prosody Modeling Method for Chinese Speech Recognition

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Context Dependent Initial/final Acoustic Modeling for Continuous Chinese Speech Recognition

Adapting pretrained speech model for Mandarin lyrics transcription and alignment

Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition

Robust F0 Modeling for Mandarin Speech Recognition in Noise.

Mandarin Pronunciation Modeling Based on CASS Corpus.

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

Investigation of Deep Neural Network Acoustic Modelling Approaches for Low Resource Accented Mandarin Speech Recognition

Improved context-dependent acoustic modeling for continuous Chinese speech recognition

Research on Inter-Syllable Context-Dependent Acoustic Unit for Mandarin Continuous Speech Recognition.

Research on Modeling Units of Transformer Transducer for Mandarin Speech Recognition

Speech Recognition System Based on SCHMM/ANN in Noisy Environment

Probabilistic Speaker-Class Based Acoustic Modeling for Large Vocabulary Continuous Speech Recognition