Abstract:Automatic Speech Recognition (ASR) for low-resource languages remains a challenging task due to limited training data. This paper introduces a comprehensive study exploring the effectiveness of Whisper, a pre-trained ASR model, for Northern Kurdish (Kurmanji) an under-resourced language spoken in the Middle East. We investigate three fine-tuning strategies: vanilla, specific parameters, and additional modules. Using a Northern Kurdish fine-tuning speech corpus containing approximately 68 hours of validated transcribed data, our experiments demonstrate that the additional module fine-tuning strategy significantly improves ASR accuracy on a specialized test set, achieving a Word Error Rate (WER) of 10.5% and Character Error Rate (CER) of 5.7% with Whisper version 3. These results underscore the potential of sophisticated transformer models for low-resource ASR and emphasize the importance of tailored fine-tuning techniques for optimal performance.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to address the challenges in automatic speech recognition (ASR) for low - resource languages such as Kurmanji (Northern Kurdish). Specifically, the researchers attempt to improve the ASR performance for Northern Kurdish by fine - tuning the pre - trained Whisper model. The following are the main problems that the paper attempts to solve: 1. **Data scarcity in low - resource languages**: - Low - resource languages such as Northern Kurdish lack sufficient training data, which makes it difficult for traditional ASR systems to achieve good performance. - The researchers used approximately 68 hours of validation transcription data to explore how to effectively utilize the limited data for model fine - tuning. 2. **Effectiveness of fine - tuning strategies**: - The paper explored three different fine - tuning strategies (vanilla, specific parameters, and additional modules) to determine which method can significantly improve the accuracy of the ASR system. - Through experimental verification, the performance of these fine - tuning strategies on different versions of the Whisper model was examined to find the optimal fine - tuning method. 3. **Research on the internal mechanism of the model**: - The researchers not only focused on the performance improvement after fine - tuning but also deeply analyzed the internal mechanism of the Whisper model, especially the way it encodes speech. - This helps to understand why some fine - tuning strategies are more effective than others and provides theoretical support for future improvements. 4. **Generality of ASR for low - resource languages**: - The research results show that the appropriately fine - tuned Whisper model can achieve a relatively high accuracy rate on low - resource languages, which has important reference value for the ASR development of other similar languages. ### Experimental Results Through a series of experiments, the researchers found that: - **Vanilla Fine - Tuning**: All parameters are involved in fine - tuning. Although the overall performance is improved, there is still room for improvement. - **Specific Parameter Fine - Tuning**: Only key parameters (such as the attention layer) are adjusted, which further improves the accuracy and reduces the over - fitting risk. - **Additional Module Fine - Tuning**: By introducing additional modules (such as a new tokenizer), the word error rate (WER) and character error rate (CER) are significantly reduced while maintaining the generalization ability of the model. Finally, Whisper V3 combined with the additional module fine - tuning strategy achieved the best results, with a WER of 10.5% and a CER of 5.7%. This indicates that for low - resource languages such as Northern Kurdish, advanced Transformer models and optimized fine - tuning techniques can significantly improve the performance of ASR systems. ### Conclusion This research shows that by fine - tuning the Whisper model, especially the latest version Whisper V3, the ASR performance of low - resource languages (such as Northern Kurdish) can be significantly improved. This research not only provides a new benchmark for the ASR of Northern Kurdish but also provides valuable references for the ASR development of other low - resource languages. Future work can further explore more diverse datasets and language features to continue to improve the performance of ASR systems.

End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach

Breaking Walls: Pioneering Automatic Speech Recognition for Central Kurdish: End-to-End Transformer Paradigm

Implementation of a Whisper Architecture-Based Turkish Automatic Speech Recognition (ASR) System and Evaluation of the Effect of Fine-Tuning with a Low-Rank Adaptation (LoRA) Adapter on Its Performance

Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training

Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages

Exploration of Whisper fine-tuning strategies for low-resource ASR

Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach

End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network

Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

Exploring Native and Non-Native English Child Speech Recognition With Whisper

Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages

Whisper Finetuning on Nepali Language

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Dynamic Acoustic Unit Augmentation with BPE-Dropout for Low-Resource End-to-End Speech Recognition

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text

Leveraging Self-Supervised Models for Automatic Whispered Speech Recognition

Efficient Compression of Multitask Multilingual Speech Models

End-to-end automated speech recognition using a character based small scale transformer architecture

N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper