Exploring Character Trigrams for Robust Arabic Text Classification: A Comparative Analysis in the Face of Vocabulary Expansion and Misspelled Words
Dorieh Alomari,Irfan Ahmad
DOI: https://doi.org/10.1109/access.2024.3390048
IF: 3.9
2024-04-27
IEEE Access
Abstract:Tokenization is an important early step in natural language processing (NLP) tasks. The idea is to split the input sentence into smaller units, called tokens, for further processing. Words are the most commonly used tokens in text classification tasks but other tokenization ideas are also popular such as subword and character tokens. The increasing availability of training corpora has posed challenges for the word tokenization technique, primarily due to vocabulary size expansion. This has underscored the importance of exploring alternative tokenization, especially for morphologically rich languages like Arabic. In this study, we assess the efficacy of character trigrams for Arabic sentiment analysis and text classification, particularly their robustness against misspelled words. We compare character trigrams with word and WordPiece tokenization across five datasets, encompassing tweets sentiment analysis, reviews sentiment analysis, and news classification. These datasets, which range from small to large in vocabulary size, facilitate a comparative examination of vocabulary and Out-Of-Vocabulary (OOV) sizes for word and character trigram embeddings. The word and character trigram embeddings are integrated into a deep learning (DL) model featuring a dense layer connected to a Bidirectional Gated Recurrent Unit (BiGRU) for classification purposes. Meanwhile, WordPiece embeddings serve to fine-tune the AraBert model. Our findings reveal that the word embedding approach, when applied to extensive corpora, leads to a significant increase in vocabulary and OOV rates, whereas character trigram embedding maintains manageable sizes for both. On test sets devoid of spelling mistakes, the AraBert model surpasses the other models in performance. However, both word and character trigram embedding models exhibit similar levels of performance. In contrast, datasets with misspellings reveal a performance degradation for the WordPiece and word embedding models, with decline rates ranging between 2%-14% and 5%-19%, respectively. Meanwhile, the character trigram models exhibit stable performances, with a drop rate of 0%-8%. Notably, the performance of WordPiece and word embedding models suffers due to their inability to recognize misspelled words.
computer science, information systems,telecommunications,engineering, electrical & electronic