Abstract:Large language models (LLMs) have excelled in numerous benchmarks, advancing AI applications in both linguistic and non-linguistic tasks. However, this has primarily benefited well-resourced languages, leaving less-resourced ones (LRLs) at a disadvantage. In this paper, we highlight the current state of the NLP field in the specific LRL: kyrgyz tili. Human evaluation, including annotated datasets created by native speakers, remains an irreplaceable component of reliable NLP performance, especially for LRLs where automatic evaluations can fall short. In recent assessments of the resources for Turkic languages, Kyrgyz is labeled with the status 'Scraping By', a severely under-resourced language spoken by millions. This is concerning given the growing importance of the language, not only in Kyrgyzstan but also among diaspora communities where it holds no official status. We review prior efforts in the field, noting that many of the publicly available resources have only recently been developed, with few exceptions beyond dictionaries (the processed data used for the analysis is presented at <a class="link-external link-https" href="https://kyrgyznlp.github.io/" rel="external noopener nofollow">this https URL</a>). While recent papers have made some headway, much more remains to be done. Despite interest and support from both business and government sectors in the Kyrgyz Republic, the situation for Kyrgyz language resources remains challenging. We stress the importance of community-driven efforts to build these resources, ensuring the future advancement sustainability. We then share our view of the most pressing challenges in Kyrgyz NLP. Finally, we propose a roadmap for future development in terms of research topics and language resources.

A Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz

Development and Evaluation of Task-Specific NLP Framework in China.

MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

MC2: Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Recent Advancements and Challenges of Turkic Central Asian Language Processing

Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: an Overview

Text Filtering through Multi-Pattern Matching: A Case Study of Wu–Manber–Uy on the Language of Uyghur

Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration

Improving Uyghur ASR systems with decoders using morpheme-based language models

Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

KyrgyzNLP: Challenges, Progress, and Future

Design and Implementation of a Tool for Extracting Uzbek Syllables

Uyghur, Chinese and English Multilingual Document Recognition

Error Analysis of Uyghur Name Tagging: Language-specific Techniques and Remaining Challenges.

A free Kazakh speech database and a speech recognition baseline.

Unification of Balti and trans-border sister dialects in the essence of LLMs and AI Technology

Uyghur Morphological Segmentation with Bidirectional GRU Neural Networks

Machine learning based framework for fine-grained word segmentation and enhanced text normalization for low resourced language

Text Split Upon Space Silence Tag Insertion Letter To Unicode Transformation AssameseTamil Gujarati Pause after SWord Pause at the End Pause in punctuation Label Generation Context information For Tree-Based Clustering Letter Sets Text Tegulu Rajasthan

Design and implementation of prototype system for online handwritten Uyghur character recognition

Robust and Parallel Uyghur Text Localization in Complex Background Images