Abstract:Large language models (LLMs) have excelled in numerous benchmarks, advancing AI applications in both linguistic and non-linguistic tasks. However, this has primarily benefited well-resourced languages, leaving less-resourced ones (LRLs) at a disadvantage. In this paper, we highlight the current state of the NLP field in the specific LRL: kyrgyz tili. Human evaluation, including annotated datasets created by native speakers, remains an irreplaceable component of reliable NLP performance, especially for LRLs where automatic evaluations can fall short. In recent assessments of the resources for Turkic languages, Kyrgyz is labeled with the status 'Scraping By', a severely under-resourced language spoken by millions. This is concerning given the growing importance of the language, not only in Kyrgyzstan but also among diaspora communities where it holds no official status. We review prior efforts in the field, noting that many of the publicly available resources have only recently been developed, with few exceptions beyond dictionaries (the processed data used for the analysis is presented at <a class="link-external link-https" href="https://kyrgyznlp.github.io/" rel="external noopener nofollow">this https URL</a>). While recent papers have made some headway, much more remains to be done. Despite interest and support from both business and government sectors in the Kyrgyz Republic, the situation for Kyrgyz language resources remains challenging. We stress the importance of community-driven efforts to build these resources, ensuring the future advancement sustainability. We then share our view of the most pressing challenges in Kyrgyz NLP. Finally, we propose a roadmap for future development in terms of research topics and language resources.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the scarcity of resources in the field of Kyrgyz (кыргыз тили) natural language processing (NLP). Specifically, the paper focuses on the following aspects: 1. **Resource Scarcity**: Although large - language models (LLMs) perform well in many benchmark tests and have promoted the capabilities of artificial intelligence in language understanding and generation, these advances mainly benefit resource - rich languages, while less - resourced languages (LRLs) such as Kyrgyz are at a disadvantage. The paper points out that Kyrgyz is classified as "barely surviving", indicating a serious shortage in digital tools, data sets and models. 2. **Community - Driven Efforts**: The paper emphasizes the importance of community - driven efforts in building NLP resources for less - resourced languages to ensure the sustainability of future development. 3. **Review of Existing Resources**: The paper reviews the existing efforts in the field of Kyrgyz NLP, pointing out that many publicly available resources have been developed only recently and that, apart from dictionaries, other resources are relatively scarce. 4. **Challenges Faced**: The paper identifies the most urgent challenges in the field of Kyrgyz NLP, including resource scarcity, writing systems and dialect diversity, complex agglutinative morphology, and the fragmentation of existing efforts. 5. **Future Development Roadmap**: The paper proposes a roadmap for future development, outlining key research areas and necessary language resources, aiming to encourage support from the government and the private sector and ensure that advanced language technologies can benefit all language communities. Through these analyses, the paper aims to provide a comprehensive perspective on the development of Kyrgyz NLP and propose specific solutions and recommendations to promote progress in this field.

KyrgyzNLP: Challenges, Progress, and Future

Recent Advancements and Challenges of Turkic Central Asian Language Processing

Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets

Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: an Overview

From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation

A Multilingual Language Processing Tool for Uyghur, Kazak and Kirghiz

No Language Left Behind: Scaling Human-Centered Machine Translation

Research on The Current Situation of Chinese Language Teaching in Kyrgyzstan

KYRGYZ-RUSSIAN SLAVIC UNIVERSITY IN THE NEW CULTURAL-LANGUAGE REALITIES

A free Kazakh speech database and a speech recognition baseline.

LLMs for Extremely Low-Resource Finno-Ugric Languages

Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce

Using RDF Models to Create Knowledge Bases in the Kazakh Language: Comparison with Other Methods

Building Low-Resource NER Models Using Non-Speaker Annotation

Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration

Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust

NLP Progress in Indigenous Latin American Languages

BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP

Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russia

Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for Russian

LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages