End-to-End Transformer-based Automatic Speech Recognition for Northern Kurdish: A Pioneering Approach

Abdulhady Abas Abdullah,Shima Tabibian,Hadi Veisi,Aso Mahmudi,Tarik Rashid
2024-10-19
Abstract:Automatic Speech Recognition (ASR) for low-resource languages remains a challenging task due to limited training data. This paper introduces a comprehensive study exploring the effectiveness of Whisper, a pre-trained ASR model, for Northern Kurdish (Kurmanji) an under-resourced language spoken in the Middle East. We investigate three fine-tuning strategies: vanilla, specific parameters, and additional modules. Using a Northern Kurdish fine-tuning speech corpus containing approximately 68 hours of validated transcribed data, our experiments demonstrate that the additional module fine-tuning strategy significantly improves ASR accuracy on a specialized test set, achieving a Word Error Rate (WER) of 10.5% and Character Error Rate (CER) of 5.7% with Whisper version 3. These results underscore the potential of sophisticated transformer models for low-resource ASR and emphasize the importance of tailored fine-tuning techniques for optimal performance.
Audio and Speech Processing,Computation and Language
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to address the challenges in automatic speech recognition (ASR) for low - resource languages such as Kurmanji (Northern Kurdish). Specifically, the researchers attempt to improve the ASR performance for Northern Kurdish by fine - tuning the pre - trained Whisper model. The following are the main problems that the paper attempts to solve: 1. **Data scarcity in low - resource languages**: - Low - resource languages such as Northern Kurdish lack sufficient training data, which makes it difficult for traditional ASR systems to achieve good performance. - The researchers used approximately 68 hours of validation transcription data to explore how to effectively utilize the limited data for model fine - tuning. 2. **Effectiveness of fine - tuning strategies**: - The paper explored three different fine - tuning strategies (vanilla, specific parameters, and additional modules) to determine which method can significantly improve the accuracy of the ASR system. - Through experimental verification, the performance of these fine - tuning strategies on different versions of the Whisper model was examined to find the optimal fine - tuning method. 3. **Research on the internal mechanism of the model**: - The researchers not only focused on the performance improvement after fine - tuning but also deeply analyzed the internal mechanism of the Whisper model, especially the way it encodes speech. - This helps to understand why some fine - tuning strategies are more effective than others and provides theoretical support for future improvements. 4. **Generality of ASR for low - resource languages**: - The research results show that the appropriately fine - tuned Whisper model can achieve a relatively high accuracy rate on low - resource languages, which has important reference value for the ASR development of other similar languages. ### Experimental Results Through a series of experiments, the researchers found that: - **Vanilla Fine - Tuning**: All parameters are involved in fine - tuning. Although the overall performance is improved, there is still room for improvement. - **Specific Parameter Fine - Tuning**: Only key parameters (such as the attention layer) are adjusted, which further improves the accuracy and reduces the over - fitting risk. - **Additional Module Fine - Tuning**: By introducing additional modules (such as a new tokenizer), the word error rate (WER) and character error rate (CER) are significantly reduced while maintaining the generalization ability of the model. Finally, Whisper V3 combined with the additional module fine - tuning strategy achieved the best results, with a WER of 10.5% and a CER of 5.7%. This indicates that for low - resource languages such as Northern Kurdish, advanced Transformer models and optimized fine - tuning techniques can significantly improve the performance of ASR systems. ### Conclusion This research shows that by fine - tuning the Whisper model, especially the latest version Whisper V3, the ASR performance of low - resource languages (such as Northern Kurdish) can be significantly improved. This research not only provides a new benchmark for the ASR of Northern Kurdish but also provides valuable references for the ASR development of other low - resource languages. Future work can further explore more diverse datasets and language features to continue to improve the performance of ASR systems.