ArEntail: manually-curated Arabic natural language inference dataset from news headlines
Rasha Obeidat,Yara Al-Harahsheh,Mahmoud Al-Ayyoub,Maram Gharaibeh
DOI: https://doi.org/10.1007/s10579-024-09731-1
2024-04-24
Language Resources and Evaluation
Abstract:Natural language inference (NLI), also known as textual entailment recognition (TER), is a crucial task in natural language processing that combines many fundamental aspects of language understanding. Despite the recent significant advancement in NLI, primarily driven by the development of diverse large-scale datasets, most of the progress has been confined to English. This is attributed to the scarcity of human-annotated corpora for most other languages, notably Arabic. In this paper, we present an Arabic NLI dataset called ArEntail , consisting of 6000 sentence pairs collected from news headlines and manually labeled to indicate whether an entailment relationship links the sentences or not without resorting to machine translation from English datasets. To our knowledge, this is the largest yet human-crafted NLI dataset for the Arabic language. We offer various data analyses and establish baseline results using state-of-the-art pre-trained models for Arabic, in addition to a human-based evaluation. Our findings revealed that AraBERT-base v2, the best-performing model, achieves an accuracy of 93%, revealing a gap of 2.6% compared to human performance and presenting a valuable opportunity for further advancements in modeling techniques in future research. Besides, the "hypothesis-only" baseline performance baseline closely resembles a random guesser's, indicating the rarity of annotation artifacts compared to prior NLI English benchmarks. We also evaluated GPT-3.5-turbo in zero-shot and few-shot Arabic NLI learning scenarios and observed promising outcomes with a cautious approach, awaiting strong clues for predicting the presence of the entailment relationship.
computer science, interdisciplinary applications