Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Benjamin Muller,Benoit Sagot,Djamé Seddah

DOI: https://doi.org/10.48550/arXiv.2005.00318

2020-05-01

Abstract:Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is whether multilingual language models can handle unseen dialects. Specifically, taking North African Arabic (Narabizi) as an example, the author studies whether multilingual language models (such as mBERT) can successfully transfer knowledge from known languages to this unseen and resource - poor dialect in zero - sample and unsupervised adaptation scenarios. Narabizi is an Arabic dialect widely used in Algeria. It is often mixed with French and written in the Latin alphabet, but it has no standard spelling or transliteration rules, which makes it very challenging for natural language processing. The paper evaluates the cross - language transfer ability of the model through two tasks - part - of - speech tagging and dependency syntactic analysis, especially the transfer ability from two extreme cases: Modern Standard Arabic (cross - script) and Maltese (a language that is distantly related to North African Arabic and has not been seen in pre - training). The research results pave the way for the development of the natural language processing ecosystem for low - resource, non - standardized and highly variable dialects.

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Can Character-based Language Models Improve Downstream Task Performance in Low-Resource and Noisy Language Scenarios?

Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

Low Resource Arabic Dialects Transformer Neural Machine Translation Improvement through Incremental Transfer of Shared Linguistic Features

Content-Localization based Neural Machine Translation for Informal Dialectal Arabic: Spanish/French to Levantine/Gulf Arabic

The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus

Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic

Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

A multilingual training strategy for low resource Text to Speech

Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

Improving neural machine translation for low resource languages through non-parallel corpora: a case study of Egyptian dialect to modern standard Arabic translation

Arabic dialect identification in social media: A hybrid model with transformer models and BiLSTM

Bilingual Adaptation of Monolingual Foundation Models

Exploiting Dialect Identification in Automatic Dialectal Text Normalization

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

Language-agnostic Multilingual Modeling

Languages You Know Influence Those You Learn: Impact of Language Characteristics on Multi-Lingual Text-to-Text Transfer

OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal Conversations on Online Social Media

A Survey of Large Language Models for Arabic Language and its Dialects