Abstract:Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan. Having a good Optical Character Recognition (OCR) to help process these publications and contribute to the Kurdish languages resources which is crucial as Kurdish is considered a low-resource language. Current OCR systems are unable to extract text from historical documents as they have many issues, including being damaged, very fragile, having many marks left on them, and often written in non-standard fonts and more. This is a massive obstacle in processing these documents as currently processing them requires manual typing which is very time-consuming. In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages. Currently, there is no public dataset, and we developed our own by collecting historical documents from Zheen Center for Documentation and Research, which were printed before 1950 and resulted in a dataset of 1233 images of lines with transcription of each. Then we used the Arabic model as our base model and trained the model using the dataset. We used different methods to evaluate our model, Tesseracts built-in evaluator lstmeval indicated a Character Error Rate (CER) of 0.755%. Additionally, Ocreval demonstrated an average character accuracy of 84.02%. Finally, we developed a web application to provide an easy- to-use interface for end-users, allowing them to interact with the model by inputting an image of a page and extracting the text. Having an extensive dataset is crucial to develop OCR systems with reasonable accuracy, as currently, no public datasets are available for historical Kurdish documents; this posed a significant challenge in our work. Additionally, the unaligned spaces between characters and words proved another challenge with our work.

Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

Nougat: Neural Optical Understanding for Academic Documents

An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition using a Novel Transformers-based Model and an Innovative 270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

Supporting Undotted Arabic with Pre-trained Language Models

Object Recognition System for the Visually Impaired: A Deep Learning Approach using Arabic Annotation

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Strategies for Arabic Readability Modeling

An Efficient Language-Independent Multi-Font OCR for Arabic Script

A Comparative Study of Deep Learning Approaches for Arabic Language Processing

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

AraT5: Text-to-Text Transformers for Arabic Language Generation

Deep Neural Models and Retrofitting for Arabic Text Categorization

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization

AraNet: A Deep Learning Toolkit for Arabic Social Media

Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification

ADOCRNet: A Deep Learning OCR for Arabic Documents Recognition

ALLaM: Large Language Models for Arabic and English

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

mucAI at WojoodNER 2024: Arabic Named Entity Recognition with Nearest Neighbor Search