Abstract:While resources for English language are fairly sufficient to understand content on social media, similar resources in Arabic are still immature. The main reason that the resources in Arabic are insufficient is that Arabic has many dialects in addition to the standard version (MSA). Arabs do not use MSA in their daily communications; rather, they use dialectal versions. Unfortunately, social users transfer this phenomenon into their use of social media platforms, which in turn has raised an urgent need for building suitable AI models for language-dependent applications. Existing machine translation (MT) systems designed for MSA fail to work well with Arabic dialects. In light of this, it is necessary to adapt to the informal nature of communication on social networks by developing MT systems that can effectively handle the various dialects of Arabic. Unlike for MSA that shows advanced progress in MT systems, little effort has been exerted to utilize Arabic dialects for MT systems. While few attempts have been made to build translation datasets for dialectal Arabic, they are domain dependent and are not OSN cultural-language friendly. In this work, we attempt to alleviate these limitations by proposing an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects: Gulf, Yemeni, Iraqi, and Levantine. To perform the translation, we followed our proposed guideline framework for content translation, which could be universally applicable for translation between foreign languages and local dialects. We validated the authenticity of our proposed dataset by developing neural MT models for four Arabic dialects. Our results have shown a superior performance of our NMT models trained using our dataset. We believe that our dataset can reliably serve as an Arabic multidialectal translation dataset for informal MT tasks.

Exploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation

Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic

Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation

Content-Localization based Neural Machine Translation for Informal Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Automatic Standardization of Arabic Dialects for Machine Translation

OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal Conversations on Online Social Media

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Synthetic Data for Neural Machine Translation of Spoken-Dialects

A Morphologically-Aware Dictionary-based Data Augmentation Technique for Machine Translation of Under-Represented Languages

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

Semantic textual similarity for modern standard and dialectal Arabic using transfer learning.

Automatic Arabic Dialect Identification Systems for Written Texts: A Survey

Towards Zero-Shot Text-To-Speech for Arabic Dialects

Advancing AI-Driven Linguistic Analysis: Developing and Annotating Comprehensive Arabic Dialect Corpora for Gulf Countries and Saudi Arabia

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification

On the Robustness of Arabic Speech Dialect Identification

Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition