Using Punkt for Sentence Segmentation in non-Latin Scripts: Experiments on Kurdish (Sorani) Texts

Roshna Omer Abdulrahman,Hossein Hassani
DOI: https://doi.org/10.48550/arXiv.2004.14134
2020-04-30
Abstract:Segmentation is a fundamental step for most Natural Language Processing tasks. The Kurdish language is a multi-dialect, under-resourced language which is written in different scripts. The lack of various segmented corpora is one of the major bottlenecks in Kurdish language processing. We used Punkt, an unsupervised machine learning method, to segment a Kurdish corpus of Sorani dialect, written in Persian-Arabic script. According to the literature, studies on using Punkt on non-Latin data are scanty. In our experiment, we achieved an F1 score of 91.10% and had an Error Rate of 16.32%. The high Error Rate is mainly due to the situation of abbreviations in Kurdish and partly because of ordinal numerals. The data is publicly available at <a class="link-external link-https" href="https://github.com/KurdishBLARK/" rel="external noopener nofollow">this https URL</a> KTC-Segmented for non-commercial use under the CC BY-NC-SA 4.0 licence.
Computation and Language
What problem does this paper attempt to address?