Abstract:There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pre-training with phonetic or graphemic transcription, and self-supervised pre-training. We find that pre-training with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training <a class="link-external link-http" href="http://efficiency.It" rel="external noopener nofollow">this http URL</a> is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we will release the code, models and data for the whole pipeline of Whistle at <a class="link-external link-https" href="https://github.com/thu-spmi/CAT" rel="external noopener nofollow">this https URL</a> upon publication.

Fast and Efficient Multilingual Self-Supervised Pre-training for Low-Resource Speech Recognition

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models

How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information

Efficient Training of Self-Supervised Speech Foundation Models on a Compute Budget

Speech Representation Learning Through Self-supervised Pretraining And Multi-task Finetuning

Progressive Residual Extraction based Pre-training for Speech Representation Learning

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning

Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations

Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

OTF: Optimal Transport based Fusion of Supervised and Self-Supervised Learning Models for Automatic Speech Recognition

VoiceTuner: Self-Supervised Pre-training and Efficient Fine-tuning for Voice Generation

Linear-Complexity Self-Supervised Learning for Speech Processing