Abstract:This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages using OpenAI's Whisper model. In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India. Malasar language faces critical challenges for technological intervention due to its lack of a native script and absence of digital or spoken data resources. Working in collaboration with Wycliffe India and Malasar community members, we created a spoken Malasar corpus paired with transcription in Tamil script, a closely related major language. In our approach to build ASR model for Malasar, we first build an intermediate Tamil ASR, leveraging higher data availability for Tamil annotated speech. This intermediate model is subsequently fine-tuned on Malasar data, allowing for more effective ASR adaptation despite limited resources. The multistage fine-tuning strategy demonstrated significant improvements over direct fine-tuning on Malasar data alone, achieving a word error rate (WER) of 51.9%, which is 4.5% absolute reduction when compared to the direct fine-tuning method. Further a WER reduction to 47.3% was achieved through punctuation removal in post-processing, which addresses formatting inconsistencies that impact evaluation. Our results underscore the effectiveness of sequential multistage fine-tuning combined with targeted post-processing as a scalable strategy for ASR system development in low-resource languages, especially where linguistic similarities can be leveraged to bridge gaps in training data.

XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese

Indonesian Automatic Speech Recognition with XLSR-53

Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities

Towards Building ASR Systems for the Next Billion Users

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Anatomy of Industrial Scale Multilingual ASR

Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis

A Survey of Multilingual Models for Automatic Speech Recognition

Language-agnostic Multilingual Modeling

Low Resource Malay Dialect Automatic Speech Recognition Modeling Using Transfer Learning from a Standard Malay Model

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking

Crossing language identification: Multilingual ASR framework based on semantic dataset creation & Wav2Vec 2.0

Low-Resource Language Modelling of South African Languages

A General Procedure for Improving Language Models in Low-Resource Speech Recognition

A Deep Learning System for Domain-specific Speech Recognition

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages

Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation