Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training

Mohammad Majd Saad Al Deen,Maren Pielka,Jörn Hees,Bouthaina Soulef Abdou,Rafet Sifa
2023-07-27
Abstract:This paper addresses the classification of Arabic text data in the field of Natural Language Processing (NLP), with a particular focus on Natural Language Inference (NLI) and Contradiction Detection (CD). Arabic is considered a resource-poor language, meaning that there are few data sets available, which leads to limited availability of NLP methods. To overcome this limitation, we create a dedicated data set from publicly available resources. Subsequently, transformer-based machine learning models are being trained and evaluated. We find that a language-specific model (AraBERT) performs competitively with state-of-the-art multilingual approaches, when we apply linguistically informed pre-training methods such as Named Entity Recognition (NER). To our knowledge, this is the first large-scale evaluation for this task in Arabic, as well as the first application of multi-task pre-training in this context.
Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issues of Natural Language Inference (NLI) and Contradiction Detection (CD) in Arabic Natural Language Processing (NLP). Due to Arabic being a low-resource language with limited available datasets, the development of NLP methods is restricted. To tackle this problem, the researchers created a dedicated dataset and utilized Transformer-based machine learning models for training and evaluation. Specifically, they used two models: AraBERT and XLM-RoBERTa. They found that by introducing a pre-training method for Named Entity Recognition (NER), these models could rival state-of-the-art multilingual models. This is the first large-scale evaluation of NLI and CD tasks in Arabic and the first application of a multi-task pre-training method. Experimental results show that models fine-tuned for specific languages (such as AraBERT) can even outperform extensively multilingual pre-trained models (such as XLM-RoBERTa) in certain tasks.