Dallah: A Dialect-Aware Multimodal Large Language Model for Arabic

Fakhraddin Alwajih,Gagan Bhatia,Muhammad Abdul-Mageed
2024-07-26
Abstract:Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed Dallah, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. Dallah demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, Dallah showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, Dallah has the potential to pave the way for further development of dialect-aware Arabic MLLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the challenges faced by large language models (LLMs) in Arabic when dealing with multimodal data and dialectal variations. Specifically, the paper addresses the following key issues: 1. **Scarcity of Multimodal Resources**: The development of current multimodal large language models is primarily focused on English. Due to the relative scarcity of high-quality multimodal resources in other languages, this limits the development of these models in non-English languages such as Arabic. 2. **Arabic Dialect Processing**: Arabic has a rich variety of dialects, with significant differences between them. Existing natural language processing (NLP) models are usually designed to handle Modern Standard Arabic (MSA) and have limited capabilities in processing dialectal variations. 3. **Multimodal Interaction**: The paper also focuses on how to enable models to effectively handle complex dialectal interactions that include both text and visual elements, which is crucial for improving user interaction experience and preserving linguistic diversity. To address the above issues, the paper proposes a multimodal Arabic assistant model named "Dallah." This model is built on LLaMA-2 and achieves its goals through the following methods: - **Data Translation and Filtering**: The paper proposes a translation and filtering method to convert English-centric image-text pair datasets into Arabic while ensuring data quality. This method includes using the Google Translate API for translation and employing a sentence embedding model for post-translation similarity assessment to ensure translation quality. - **Dialect Dataset Construction**: To handle the diversity of Arabic dialects, the paper randomly selected dialect data subsets from six different countries (Egypt, Mauritania, Morocco, Palestine, Saudi Arabia, and Yemen) and had them translated from Modern Standard Arabic into their respective dialects by professional translators. - **Model Architecture and Training**: The Dallah model adopts an advanced architecture, including a visual encoder, projection layer, and a language model based on AraLLaMA. The model training is divided into three stages: pre-training, instruction fine-tuning, and dialect instruction fine-tuning, to gradually enhance the model's capabilities in multimodal understanding and dialect processing. Through these methods, Dallah not only performs excellently on multimodal tasks in Modern Standard Arabic but also demonstrates strong performance in handling six major Arabic dialects. Additionally, the paper introduces a benchmark test specifically for evaluating Arabic dialect understanding (Dallah-Bench) to assess the model's dialect comprehension in practical applications.