ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Mohammed Khalil,Mohammed Sabry

2024-07-29

Abstract:Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples that cover a wide array of subjects including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a need for such datasets in current systems. Our findings highlight how models can benefit from fine-tuning or incorporating this dataset into their pretraining pipelines. The dataset is publicly available on the HuggingFace Data Hub at \url{<a class="link-external link-https" href="https://huggingface.co/datasets/mohamed-khalil/ATHAR" rel="external noopener nofollow">this https URL</a>}.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper mainly addresses the following issues: 1. **Background and Challenges**: - **Challenges in Classical Arabic Translation**: Classical Arabic has significant differences in grammar, vocabulary, etc., compared to Modern Standard Arabic (MSA), which poses challenges for translation. - **Limitations of Existing Datasets**: The current datasets for machine translation of Classical Arabic are limited in number and often focus on religious texts or specific themes, lacking diversity and representativeness. 2. **Construction of the ATHAR Dataset**: - **Objective**: To develop a high-quality, diverse Classical Arabic to English translation dataset to support a wider range of translation tasks. - **Content**: Contains 66,000 translation samples covering various fields such as science, culture, philosophy, etc. - **Sources**: Collected from multiple classical works on the Rasaif website, including historical documents, philosophical works, etc. - **Preprocessing**: Steps such as data cleaning, language identification, and alignment verification were performed to ensure data quality. 3. **Evaluation and Experiments**: - **Evaluation Methods**: Different settings such as zero-shot, few-shot, and fine-tuning were used to evaluate the performance of large language models (LLMs). - **Model Selection**: Advanced models like GPT-4o, Llama-3 70B, Llama-3 8B, and Llama-2 7B were selected for evaluation. - **Results**: The results showed significant performance differences under different settings. Fine-tuning, especially LoRA fine-tuning, can significantly improve the translation performance of the models. 4. **Conclusion and Outlook**: - **Importance**: The ATHAR dataset is crucial for improving the quality of Classical Arabic translation systems. - **Future Work**: Plans to further expand the dataset to include more types of texts and topics to further improve translation quality. In summary, this paper aims to fill the gap in the field of Classical Arabic translation by constructing the ATHAR dataset and demonstrates through experiments the potential of this dataset in improving translation system performance.

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

101 Billion Arabic Words Dataset

AHD: Arabic healthcare dataset

OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal Conversations on Online Social Media

Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models

ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

A scarce dataset for ancient Arabic handwritten text recognition

A Survey of Large Language Models for Arabic Language and its Dialects

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations

Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.

ArTST: Arabic Text and Speech Transformer

AltecOnDB: A Large-Vocabulary Arabic Online Handwriting Recognition Database

The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics

ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

A new Arabic handwritten character recognition deep learning system (AHCR-DLS)

Arabic Dataset for LLM Safeguard Evaluation