Abstract:This paper presents a novel multistage fine-tuning strategy designed to enhance automatic speech recognition (ASR) performance in low-resource languages using OpenAI's Whisper model. In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. We experimented this on the Malasar language, a Dravidian language spoken by approximately ten thousand people in the Western Ghats of South India. Malasar language faces critical challenges for technological intervention due to its lack of a native script and absence of digital or spoken data resources. Working in collaboration with Wycliffe India and Malasar community members, we created a spoken Malasar corpus paired with transcription in Tamil script, a closely related major language. In our approach to build ASR model for Malasar, we first build an intermediate Tamil ASR, leveraging higher data availability for Tamil annotated speech. This intermediate model is subsequently fine-tuned on Malasar data, allowing for more effective ASR adaptation despite limited resources. The multistage fine-tuning strategy demonstrated significant improvements over direct fine-tuning on Malasar data alone, achieving a word error rate (WER) of 51.9%, which is 4.5% absolute reduction when compared to the direct fine-tuning method. Further a WER reduction to 47.3% was achieved through punctuation removal in post-processing, which addresses formatting inconsistencies that impact evaluation. Our results underscore the effectiveness of sequential multistage fine-tuning combined with targeted post-processing as a scalable strategy for ASR system development in low-resource languages, especially where linguistic similarities can be leveraged to bridge gaps in training data.

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

Dialect Adaptation and Data Augmentation for Low-Resource ASR: TalTech Systems for the MADASR 2023 Challenge

The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

The SpeakIn System Description for CNSRC2022

The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task

The NTNU Taiwanese ASR System for Formosa Speech Recognition Challenge 2020

Multi-Speaker Multi-Lingual VQTTS System for LIMMITS 2023 Challenge

MSV Challenge 2022: NPU-HC Speaker Verification System for Low-resource Indian Languages

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

Towards Building ASR Systems for the Next Billion Users

The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge

Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages

I4U System Description for NIST SRE'20 CTS Challenge

The NPU-ASLP System Description for Visual Speech Recognition in CNVSRC 2024

SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

The USTC-NERCSLIP Systems for the CHiME-8 NOTSOFAR-1 Challenge

LeVoice ASR Systems for the ISCSLP 2022 Intelligent Cockpit Speech Recognition Challenge

Transformer-Transducers for Code-Switched Speech Recognition

The USYD-JD Speech Translation System for IWSLT 2021