Abstract:This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages using OpenAI's Whisper model. In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India. Malasar language faces critical challenges for technological intervention due to its lack of a native script and absence of digital or spoken data resources. Working in collaboration with Wycliffe India and Malasar community members, we created a spoken Malasar corpus paired with transcription in Tamil script, a closely related major language. In our approach to build ASR model for Malasar, we first build an intermediate Tamil ASR, leveraging higher data availability for Tamil annotated speech. This intermediate model is subsequently fine-tuned on Malasar data, allowing for more effective ASR adaptation despite limited resources. The multistage fine-tuning strategy demonstrated significant improvements over direct fine-tuning on Malasar data alone, achieving a word error rate (WER) of 51.9%, which is 4.5% absolute reduction when compared to the direct fine-tuning method. Further a WER reduction to 47.3% was achieved through punctuation removal in post-processing, which addresses formatting inconsistencies that impact evaluation. Our results underscore the effectiveness of sequential multistage fine-tuning combined with targeted post-processing as a scalable strategy for ASR system development in low-resource languages, especially where linguistic similarities can be leveraged to bridge gaps in training data.

Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR

Transfer learning of language-independent end-to-end ASR with language model fusion

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

Language-agnostic Multilingual Modeling

Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation

Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition

Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages

A Parameter-efficient Language Extension Framework for Multilingual ASR

From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition

Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

Meta-adapter: efficient cross-lingual adaptation with meta-learning

Model Adaptation for ASR in low-resource Indian Languages

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

Adapting the adapters for code-switching in multilingual ASR

Consistency Based Unsupervised Self-training For ASR Personalisation

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

Multiway-Adapter: Adapting Multimodal Large Language Models for Scalable Image-Text Retrieval

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study