EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

Faisal Qarah

2024-08-07

Abstract:This study presents EgyBERT, an Arabic language model pretrained on 10.4 GB of Egyptian dialectal texts. We evaluated EgyBERT's performance by comparing it with five other multidialect Arabic language models across 10 evaluation datasets. EgyBERT achieved the highest average F1-score of 84.25% and an accuracy of 87.33%, significantly outperforming all other comparative models, with MARBERTv2 as the second best model achieving an F1-score 83.68% and an accuracy 87.19%. Additionally, we introduce two novel Egyptian dialectal corpora: the Egyptian Tweets Corpus (ETC), containing over 34.33 million tweets (24.89 million sentences) amounting to 2.5 GB of text, and the Egyptian Forums Corpus (EFC), comprising over 44.42 million sentences (7.9 GB of text) collected from various Egyptian online forums. Both corpora are used in pretraining the new model, and they are the largest Egyptian dialectal corpora to date reported in the literature. Furthermore, this is the first study to evaluate the performance of various language models on Egyptian dialect datasets, revealing significant differences in performance that highlight the need for more dialect-specific models. The results confirm the effectiveness of EgyBERT model in processing and analyzing Arabic text expressed in Egyptian dialect, surpassing other language models included in the study. EgyBERT model is publicly available on \url{<a class="link-external link-https" href="https://huggingface.co/faisalq/EgyBERT" rel="external noopener nofollow">this https URL</a>}.

Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the problem of developing a large language model specifically for the Egyptian dialect (EgyBERT) to improve the processing and analysis of Arabic Egyptian dialect texts. Specifically, the main objectives of the paper include: 1. **Building a large Egyptian dialect corpus**: The paper introduces two new large Egyptian dialect corpora—the Egyptian Twitter Corpus (ETC) and the Egyptian Forum Corpus (EFC). These two corpora are the largest Egyptian dialect corpora in the literature, containing over 34.33 million tweets (2.489 billion sentences, 2.5 GB of text) and over 44.42 million sentences (7.9 GB of text), respectively. 2. **Pre-training the EgyBERT model**: Using the aforementioned two corpora to pre-train the EgyBERT model, enhancing its performance in processing Egyptian dialect texts. 3. **Evaluating model performance**: Validating the performance of the EgyBERT model by comparing it with five multi-dialect Arabic language models on 10 evaluation datasets. These datasets cover tasks such as sentiment analysis, text classification, sarcasm detection, and gender identification. 4. **Filling the research gap**: Despite the numerous studies on Arabic natural language processing, models specifically targeting the Egyptian dialect are still lacking. EgyBERT aims to fill this gap and improve the processing capabilities for Egyptian dialect texts. Through these efforts, the paper demonstrates the superior performance of EgyBERT in handling Egyptian dialect texts, particularly showing significant improvements in average F1 score and accuracy compared to other models.

EgyBERT: A Large Language Model Pretrained on Egyptian Dialect Corpora

SaudiBERT: A Large Language Model Pretrained on Saudi Dialect Corpora

Multi-Dialect Arabic BERT for Country-Level Dialect Identification

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

A Survey of Large Language Models for Arabic Language and its Dialects

Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

AlcLaM: Arabic Dialectal Language Model

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

Impact of Using Bidirectional Encoder Representations from Transformers (BERT) Models for Arabic Dialogue Acts Identification

Arabic dialect identification in social media: A hybrid model with transformer models and BiLSTM

AraLegal-BERT: A pretrained language model for Arabic Legal text

Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early Modern French

A Comparative Study of Deep Learning Approaches for Arabic Language Processing

BERT-Based Arabic Social Media Author Profiling

Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

On the importance of Data Scale in Pretraining Arabic Language Models

The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets

Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect