Abstract:Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

Kurdish Text Segmentation using Projection-Based Approaches

KurdiSent: a corpus for kurdish sentiment analysis

Towards Finite-State Morphology of Kurdish

Kurdish (Sorani) Speech to Text: Presenting an Experimental Dataset

Where Are You From? Let Me Guess! Subdialect Recognition of Speeches in Sorani Kurdish

Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation

Hunspell for Sorani Kurdish Spell Checking and Morphological Analysis

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

A Dataset of Kurdish (Sorani) Named Entities -- An Amendment to Kurdish-BLARK Named Entities

Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji

A Rule-based Kurdish Text Transliteration System

Kurdish Handwritten character recognition using deep learning techniques

Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets

Central Kurdish machine translation: First large scale parallel corpus and experiments

Building a benchmark dataset for the Kurdish news question answering

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

Language and Speech Technology for Central Kurdish Varieties

Bridging the Kuwaiti Dialect Gap in Natural Language Processing

Problem-Oriented Segmentation and Retrieval: Case Study on Tutoring Conversations