A Single-Model Approach for Arabic Segmentation, POS Tagging, and Named Entity Recognition

Abed Alhakim Freihat,Gabor Bella,Hamdy Mubarak,Fausto Giunchiglia
DOI: https://doi.org/10.1109/icnlsp.2018.8374393
2018-01-01
Abstract:This paper presents an entirely new, one-million-word annotated corpus for a comprehensive, machine-learning-based preprocessing of text in Modern Standard Arabic. Contrary to the conventional pipeline architecture, we solve the NLP tasks of word segmentation, POS tagging and named entity recognition as a single sequence labeling task. This single-component configuration results in a faster operation and is able to provide state-of-the-art precision and recall according to our evaluations. The fine-grained output tag set output by our annotator greatly simplifies downstream tasks such as lemmatization. Provided as a trained OpenNLP component, the annotator is free for research purposes.
What problem does this paper attempt to address?