EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

Faisal Qarah
2024-08-07
Abstract:This study presents EgyBERT, an Arabic language model pretrained on 10.4 GB of Egyptian dialectal texts. We evaluated EgyBERT's performance by comparing it with five other multidialect Arabic language models across 10 evaluation datasets. EgyBERT achieved the highest average F1-score of 84.25% and an accuracy of 87.33%, significantly outperforming all other comparative models, with MARBERTv2 as the second best model achieving an F1-score 83.68% and an accuracy 87.19%. Additionally, we introduce two novel Egyptian dialectal corpora: the Egyptian Tweets Corpus (ETC), containing over 34.33 million tweets (24.89 million sentences) amounting to 2.5 GB of text, and the Egyptian Forums Corpus (EFC), comprising over 44.42 million sentences (7.9 GB of text) collected from various Egyptian online forums. Both corpora are used in pretraining the new model, and they are the largest Egyptian dialectal corpora to date reported in the literature. Furthermore, this is the first study to evaluate the performance of various language models on Egyptian dialect datasets, revealing significant differences in performance that highlight the need for more dialect-specific models. The results confirm the effectiveness of EgyBERT model in processing and analyzing Arabic text expressed in Egyptian dialect, surpassing other language models included in the study. EgyBERT model is publicly available on \url{<a class="link-external link-https" href="https://huggingface.co/faisalq/EgyBERT" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the problem of developing a large language model specifically for the Egyptian dialect (EgyBERT) to improve the processing and analysis of Arabic Egyptian dialect texts. Specifically, the main objectives of the paper include: 1. **Building a large Egyptian dialect corpus**: The paper introduces two new large Egyptian dialect corpora—the Egyptian Twitter Corpus (ETC) and the Egyptian Forum Corpus (EFC). These two corpora are the largest Egyptian dialect corpora in the literature, containing over 34.33 million tweets (2.489 billion sentences, 2.5 GB of text) and over 44.42 million sentences (7.9 GB of text), respectively. 2. **Pre-training the EgyBERT model**: Using the aforementioned two corpora to pre-train the EgyBERT model, enhancing its performance in processing Egyptian dialect texts. 3. **Evaluating model performance**: Validating the performance of the EgyBERT model by comparing it with five multi-dialect Arabic language models on 10 evaluation datasets. These datasets cover tasks such as sentiment analysis, text classification, sarcasm detection, and gender identification. 4. **Filling the research gap**: Despite the numerous studies on Arabic natural language processing, models specifically targeting the Egyptian dialect are still lacking. EgyBERT aims to fill this gap and improve the processing capabilities for Egyptian dialect texts. Through these efforts, the paper demonstrates the superior performance of EgyBERT in handling Egyptian dialect texts, particularly showing significant improvements in average F1 score and accuracy compared to other models.