ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Mohammed Khalil,Mohammed Sabry
2024-07-29
Abstract:Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature. With a broad consensus on the importance of translating these literatures to enrich knowledge dissemination across communities, the advent of large language models (LLMs) and translation systems offers promising tools to facilitate this goal. However, we have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics, hindering the development of high-quality translation systems. In response, we present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples that cover a wide array of subjects including science, culture, and philosophy. Furthermore, we assess the performance of current state-of-the-art LLMs under various settings, concluding that there is a need for such datasets in current systems. Our findings highlight how models can benefit from fine-tuning or incorporating this dataset into their pretraining pipelines. The dataset is publicly available on the HuggingFace Data Hub at \url{<a class="link-external link-https" href="https://huggingface.co/datasets/mohamed-khalil/ATHAR" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper mainly addresses the following issues: 1. **Background and Challenges**: - **Challenges in Classical Arabic Translation**: Classical Arabic has significant differences in grammar, vocabulary, etc., compared to Modern Standard Arabic (MSA), which poses challenges for translation. - **Limitations of Existing Datasets**: The current datasets for machine translation of Classical Arabic are limited in number and often focus on religious texts or specific themes, lacking diversity and representativeness. 2. **Construction of the ATHAR Dataset**: - **Objective**: To develop a high-quality, diverse Classical Arabic to English translation dataset to support a wider range of translation tasks. - **Content**: Contains 66,000 translation samples covering various fields such as science, culture, philosophy, etc. - **Sources**: Collected from multiple classical works on the Rasaif website, including historical documents, philosophical works, etc. - **Preprocessing**: Steps such as data cleaning, language identification, and alignment verification were performed to ensure data quality. 3. **Evaluation and Experiments**: - **Evaluation Methods**: Different settings such as zero-shot, few-shot, and fine-tuning were used to evaluate the performance of large language models (LLMs). - **Model Selection**: Advanced models like GPT-4o, Llama-3 70B, Llama-3 8B, and Llama-2 7B were selected for evaluation. - **Results**: The results showed significant performance differences under different settings. Fine-tuning, especially LoRA fine-tuning, can significantly improve the translation performance of the models. 4. **Conclusion and Outlook**: - **Importance**: The ATHAR dataset is crucial for improving the quality of Classical Arabic translation systems. - **Future Work**: Plans to further expand the dataset to include more types of texts and topics to further improve translation quality. In summary, this paper aims to fill the gap in the field of Classical Arabic translation by constructing the ATHAR dataset and demonstrates through experiments the potential of this dataset in improving translation system performance.