Abstract:There exist three approaches for multilingual and crosslingual automatic speech recognition (MCL-ASR) - supervised pre-training with phonetic or graphemic transcription, and self-supervised pre-training. We find that pre-training with phonetic supervision has been underappreciated so far for MCL-ASR, while conceptually it is more advantageous for information sharing between different languages. This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. We relax the requirement of gold-standard human-validated phonetic transcripts, and obtain International Phonetic Alphabet (IPA) based transcription by leveraging the LanguageNet grapheme-to-phoneme (G2P) models. We construct a common experimental setup based on the CommonVoice dataset, called CV-Lang10, with 10 seen languages and 2 unseen languages. A set of experiments are conducted on CV-Lang10 to compare, as fair as possible, the three approaches under the common setup for MCL-ASR. Experiments demonstrate the advantages of phoneme-based models (Whistle) for MCL-ASR, in terms of speech recognition for seen languages, crosslingual performance for unseen languages with different amounts of few-shot data, overcoming catastrophic forgetting, and training <a class="link-external link-http" href="http://efficiency.It" rel="external noopener nofollow">this http URL</a> is found that when training data is more limited, phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. To support reproducibility and promote future research along this direction, we will release the code, models and data for the whole pipeline of Whistle at <a class="link-external link-https" href="https://github.com/thu-spmi/CAT" rel="external noopener nofollow">this https URL</a> upon publication.

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages

A Multitask Training Approach to Enhance Whisper with Contextual Biasing and Open-Vocabulary Keyword Spotting

Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages

A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword Spotting

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

Towards Rehearsal-Free Multilingual ASR: A LoRA-based Case Study on Whisper

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition via Weakly Phonetic Supervision

Exploring Native and Non-Native English Child Speech Recognition With Whisper

Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics

Enhancing Whisper Model for Pronunciation Assessment with Multi-Adapters

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

Wav2vec‐MoE: an Unsupervised Pre‐training and Adaptation Method for Multi‐accent ASR

Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding

A Study on Incorporating Whisper for Robust Speech Assessment

Extending Whisper with prompt tuning to target-speaker ASR

Efficient Compression of Multitask Multilingual Speech Models