Annotation and evaluation of a dialectal Arabic sentiment corpus against benchmark datasets using transformers
Ibtissam Touahri,Azzeddine Mazroui
DOI: https://doi.org/10.1007/s10579-024-09750-y
2024-08-21
Language Resources and Evaluation
Abstract:Sentiment analysis is a task in natural language processing aiming to identify the overall polarity of reviews for subsequent analysis. This study used the Arabic speech-act and sentiment analysis, Arabic sentiment tweets dataset, and SemEval benchmark datasets, along with the Moroccan sentiment analysis corpus, which focuses on the Moroccan dialect. Furthermore, the modern standard and dialectal Arabic corpus dataset has been created and annotated based on the three language types: modern standard Arabic, Moroccan Arabic Dialect, and Mixed Language. Additionally, the annotation has been performed at the sentiment level, categorizing sentiments as positive, negative, or mixed. The sizes of the datasets range from 2000 to 21,000 reviews. The essential dialectal characteristics to enhance a sentiment classification system have been outlined. The proposed approach has involved deploying several models employing the supervised approach, including occurrence vectors, Recurrent Neural Network-Long Short Term Memory, and the pre-trained transformer model Arabic bidirectional encoder representations from transformers (AraBERT), complemented by the integration of Generative Adversarial Networks (GANs). The uniqueness of the proposed approach lies in constructing and annotating manually a dialectal sentiment corpus and studying carefully its main characteristics, which are used then to feed the classical supervised model. Moreover, GANs that widen the gap between the studied classes have been used to enhance the obtained results with AraBERT. The classification test results have been promising, enabling a comparison with other systems. The proposed system has been evaluated against Mazajak and CAMelTools state-of-the-art systems, designed for most Arabic dialects, using the mentioned datasets. A significant improvement of 30 points in F NN has been observed. These results have affirmed the versatility of the proposed system, demonstrating its effectiveness across multi-dialectal, multi-domain datasets, as well as balanced and unbalanced ones.
computer science, interdisciplinary applications