Abstract:Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.

What problem does this paper attempt to address?

The paper aims to address the challenges faced by large language models (LLMs) in Arabic when dealing with multimodal data and dialectal variations. Specifically, the paper addresses the following key issues: 1. **Scarcity of Multimodal Resources**: The development of current multimodal large language models is primarily focused on English. Due to the relative scarcity of high-quality multimodal resources in other languages, this limits the development of these models in non-English languages such as Arabic. 2. **Arabic Dialect Processing**: Arabic has a rich variety of dialects, with significant differences between them. Existing natural language processing (NLP) models are usually designed to handle Modern Standard Arabic (MSA) and have limited capabilities in processing dialectal variations. 3. **Multimodal Interaction**: The paper also focuses on how to enable models to effectively handle complex dialectal interactions that include both text and visual elements, which is crucial for improving user interaction experience and preserving linguistic diversity. To address the above issues, the paper proposes a multimodal Arabic assistant model named "Dallah." This model is built on LLaMA-2 and achieves its goals through the following methods: - **Data Translation and Filtering**: The paper proposes a translation and filtering method to convert English-centric image-text pair datasets into Arabic while ensuring data quality. This method includes using the Google Translate API for translation and employing a sentence embedding model for post-translation similarity assessment to ensure translation quality. - **Dialect Dataset Construction**: To handle the diversity of Arabic dialects, the paper randomly selected dialect data subsets from six different countries (Egypt, Mauritania, Morocco, Palestine, Saudi Arabia, and Yemen) and had them translated from Modern Standard Arabic into their respective dialects by professional translators. - **Model Architecture and Training**: The Dallah model adopts an advanced architecture, including a visual encoder, projection layer, and a language model based on AraLLaMA. The model training is divided into three stages: pre-training, instruction fine-tuning, and dialect instruction fine-tuning, to gradually enhance the model's capabilities in multimodal understanding and dialect processing. Through these methods, Dallah not only performs excellently on multimodal tasks in Modern Standard Arabic but also demonstrates strong performance in handling six major Arabic dialects. Additionally, the paper introduces a benchmark test specifically for evaluating Arabic dialect understanding (Dallah-Bench) to assess the model's dialect comprehension in practical applications.

Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

A Survey of Large Language Models for Arabic Language and its Dialects

ALLaM: Large Language Models for Arabic and English

Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in Dialectal Arabic

Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

AlcLaM: Arabic Dialectal Language Model

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect

A bilingual benchmark for evaluating large language models

Heterogeneous Ensemble Deep Learning Model for Enhanced Arabic Sentiment Analysis

LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content

Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic

Zero-Resource Multi-Dialectal Arabic Natural Language Understanding

Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning

Arabic Automatic Story Generation with Large Language Models