Arabic text classification using deep learning models

Ashraf Elnagar,Ridhwan Al-Debsi,Omar Einea
DOI: https://doi.org/10.1016/j.ipm.2019.102121
2020-01-01
Abstract:Text classification or categorization is the process of automatically tagging a textual document with most relevant labels or categories. When the number of labels is restricted to one, the task becomes single-label text categorization. However, the multi-label version is challenging. For Arabic language, both tasks (especially the latter one) become more challenging in the absence of large and free Arabic rich and rational datasets. Therefore, we introduce new rich and unbiased datasets for both the single-label (SANAD) as well as the multi-label (NADiA) Arabic text categorization tasks. Both corpora are made freely available to the research community on Arabic computational linguistics. Further, we present an extensive comparison of several deep learning (DL) models for Arabic text categorization in order to evaluate the effectiveness of such models on SANAD and NADiA. A unique characteristic of our proposed work, when compared to existing ones, is that it does not require a pre-processing phase and fully based on deep learning models. Besides, we studied the impact of utilizing word2vec embedding models to improve the performance of the classification tasks. Our experimental results showed solid performance of all models on SANAD corpus with a minimum accuracy of 91.18%, achieved by convolutional-GRU, and top performance of 96.94%, achieved by attention-GRU. As for NADiA, attention-GRU achieved the highest overall accuracy of 88.68% for a maximum subsets of 10 categories on "Masrawy" dataset.
computer science, information systems,information science & library science
What problem does this paper attempt to address?