Abstract:This data paper introduces a comprehensive dataset tailored for word sense disambiguation tasks, explicitly focusing on a hundred polysemous words frequently employed in Modern Standard Arabic. The dataset encompasses a diverse set of senses for each word, ranging from 3 to 8, resulting in 367 unique senses. Each word sense is accompanied by contextual sentences comprising ten sentence examples that feature the polysemous word in various contexts. The data collection resulted in a dataset of 3670 samples. Significantly, the dataset is in Arabic, which is known for its rich morphology, complex syntax, and extensive polysemy. The data was meticulously collected from various web sources, spanning news, medicine, finance, and more domains. This inclusivity ensures the dataset's applicability across diverse fields, positioning it as a pivotal resource for Arabic Natural Language Processing (NLP) applications. The data collection timeframe spans from the first of April 2023 to the first of May 2023. The dataset provides comprehensive model learning by including all senses for a frequently used Arabic polysemous term, even rare senses that are infrequently used in real-world contexts, thereby mitigating biases. The dataset comprises synthetic sentences generated by GPT3.5-turbo, addressing instances where rare senses lack sufficient real-world data. The dataset collection process involved initial web scraping, followed by manual sorting to distinguish word senses, supplemented by thorough searches by a human expert to fill in missing contextual sentences. Finally, in instances where online data for rare word senses was lacking or insufficient, synthetic samples were generated. Beyond its primary utility in word sense disambiguation, this dataset holds considerable value for scientists and researchers across various domains, extending its relevance to sentiment analysis applications.

A Benchmark and Scoring Algorithm for Enriching Arabic Synonyms

Current Trends and Approaches in Synonyms Extraction: Potential Adaptation to Arabic

MEANING EXTRACTION OF SYNONYMS AMONG ADVANCED ARAB-ENGLISH TRANSLATORS: A PILOT STUDY

The effects of having lists of synonyms on the performance of Afaan Oromo Text Retrieval system

EnhancedBERT: A Feature-rich Ensemble Model for Arabic Word Sense Disambiguation with Statistical Analysis and Optimized Data Collection

Boosting approximate dictionary-based entity extraction with synonyms

A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

Advancing the Arabic WordNet: Elevating Content Quality

SALMA: Arabic Sense-Annotated Corpus and WSD Benchmarks

It Runs in the Family: Searching for Synonyms Using Digitized Family Trees

Not All Synonyms Are Created Equal: Incorporating Similarity of Synonyms to Enhance Word Embeddings

Antonym vs Synonym Distinction using InterlaCed Encoder NETworks (ICE-NET)

Exploiting Wikipedia to Measure the Semantic Relatedness between Arabic Terms

Developing a Large Benchmark Corpus for Urdu Semantic Word Similarity

Mining Entity Synonyms with Efficient Neural Set Generation

A comprehensive dataset for Arabic word sense disambiguation

A new hybrid metric for verifying parallel corpora of Arabic-English

Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.

Grouping Synonyms by Definitions

SynET: Synonym Expansion Using Transitivity

Words That Stick: Predicting Decision Making and Synonym Engagement Using Cognitive Biases and Computational Linguistics