Abstract:The third wave started in 2010, when the research focused on Arabic NLP came back to the Arab world. This period witnessed a proliferation of Arab researchers and graduate students interested in Arabic NLP and an increase in publications in top conferences from the Arab world. Active universities include New York University Abu Dhabi (NYUAD),<a href="#FNB">b</a> American University in Beirut (AUB), Carnegie Mellon University in Qatar (CMUQ), King Saud University (KSU), Birzeit University (BZU), Cairo University, and others. Active research centers include Qatar Computing Research Institute (QCRI),<a href="#FNC">c</a> King Abdulaziz City for Science and Technology (KACST), and more. It should be noted that there are many actively contributing researchers in smaller groups across the Arab world. This period also overlapped with two major independent developments: the rise of deep learning and neural models, and the rise of social media. The first development affected the direction of research, pushing it further into the ML space; the second led to the increase in social media data, which introduced many new challenges at a larger scale: more dialects and more noise. This period also witnessed a welcome increase in Arabic language resources and processing tools, and a heightened awareness of the importance of AI for the future of the region—for example, the UAE now has a ministry specifically for AI. Finally, new young and ambitious companies such as Mawdoo3 are competing for a growing market and expectations in the Arab world.In the Arab world, the efforts are relatively limited in terms of creating annotated corpora. Examples include BZU's Curras, the Palestinian Arabic annotated corpus, NYUAD's Gumar, the Emirati Arabic annotated corpus, and Al-Mus'haf Quranic Arabic corpus. Another annotation effort with a focus on MSA spelling and grammar correction is the Qatar Arabic Language Bank (QALB), a project involving Columbia and CMUQ. Other specialized annotated corpora developed in the Arab world include NYUAD's parallel gender corpus with sentences in masculine and feminine for anti-gender bias research, the Arab-Acquis corpus pairing Arabic with all of Europe's languages for a portion of European parliamentary proceedings, and the MADAR corpus of parallel dialects created in collaboration with CMUQ.Although some progress has been made for both L1 and L2 PA, the dearth of resources compared with English remains the bottleneck for future progress. Resource-building efforts have focused on L1 readers with particular emphasis on grade school curricula. There is a push to inform the enhancement of curricula using pedagogical tools and to compare curricula across Arab countries. The L2 PAs are even more constrained, with limited corpora and a disproportionate focus on beginners.<a href="#FNN">n</a> There is a definite need for augmenting these corpora in a reasoned way, taking into consideration different text features and learners, both young and old, beefing up the sparsely populated levels with authentic material, and exploiting technologies such as text simplification and text error analysis and correction. Learner corpora, which as the name suggests are produced by learners of Arabic, can inform the creation of tools and corpora. A recent effort developed a large-scale Arabic readability lexicon compatible with an existing morphological analysis system.Another information retrieval-related problem is question answering, which comes in many flavors, the most common of which is attempting to identify a passage or a sentence that answers a question. Performing such a task may employ a large set of NLP tools such as parsing, NER, co-reference resolution, and text semantic representation. There has been limited research on this problem, and existing commercial solutions such as Ujeeb.com are rudimentary.

ArEntail: manually-curated Arabic natural language inference dataset from news headlines

ArNLI: Arabic Natural Language Inference for Entailment and Contradiction Detection

Improving Natural Language Inference in Arabic using Transformer Models and Linguistically Informed Pre-Training

A Study of the State of the Art Approaches and Datasets for Multilingual Natural Language Inference

ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic

AraXLNet: pre-trained language model for sentiment analysis of Arabic

ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models

ThatiAR: Subjectivity Detection in Arabic News Sentences

AraNet: A Deep Learning Toolkit for Arabic Social Media

101 Billion Arabic Words Dataset

ArabicNLU 2024: The First Arabic Natural Language Understanding Shared Task

ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

Stance Prediction and Claim Verification: An Arabic Perspective

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

A Comparative Study of Deep Learning Approaches for Arabic Language Processing

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text

IndoNLI: A Natural Language Inference Dataset for Indonesian

A panoramic survey of natural language processing in the Arab world

Exploring Factual Entailment with NLI: A News Media Study

Improving Arabic sentiment analysis across context-aware attention deep model based on natural language processing