Abstract:While resources for English language are fairly sufficient to understand content on social media, similar resources in Arabic are still immature. The main reason that the resources in Arabic are insufficient is that Arabic has many dialects in addition to the standard version (MSA). Arabs do not use MSA in their daily communications; rather, they use dialectal versions. Unfortunately, social users transfer this phenomenon into their use of social media platforms, which in turn has raised an urgent need for building suitable AI models for language-dependent applications. Existing machine translation (MT) systems designed for MSA fail to work well with Arabic dialects. In light of this, it is necessary to adapt to the informal nature of communication on social networks by developing MT systems that can effectively handle the various dialects of Arabic. Unlike for MSA that shows advanced progress in MT systems, little effort has been exerted to utilize Arabic dialects for MT systems. While few attempts have been made to build translation datasets for dialectal Arabic, they are domain dependent and are not OSN cultural-language friendly. In this work, we attempt to alleviate these limitations by proposing an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects: Gulf, Yemeni, Iraqi, and Levantine. To perform the translation, we followed our proposed guideline framework for content translation, which could be universally applicable for translation between foreign languages and local dialects. We validated the authenticity of our proposed dataset by developing neural MT models for four Arabic dialects. Our results have shown a superior performance of our NMT models trained using our dataset. We believe that our dataset can reliably serve as an Arabic multidialectal translation dataset for informal MT tasks.

Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models

101 Billion Arabic Words Dataset

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

A Survey of Large Language Models for Arabic Language and its Dialects

OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal Conversations on Online Social Media

The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics

ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation

Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

Arabic Automatic Story Generation with Large Language Models

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

Large-Scale Machine Translation between Arabic and Hebrew: Available Corpora and Initial Results

Domain Adaptation for Arabic Machine Translation: The Case of Financial Texts

SWEb: A Large Web Dataset for the Scandinavian Languages

AraSpider: Democratizing Arabic-to-SQL

AraGPT2: Pre-Trained Transformer for Arabic Language Generation

Arabic Dataset for LLM Safeguard Evaluation