Abstract:Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Existing large - scale multimodal models (LMMs) evaluation benchmarks are mainly English - centric, lacking a comprehensive and diverse Arabic LMM evaluation benchmark**. This restricts the application and improvement of these models in the Arabic - speaking environment. Specifically, the paper points out that although multimodal models have made significant progress in tasks such as visual reasoning, perception, and understanding in recent years and multiple LMM evaluation benchmarks have been introduced, most of the existing benchmarks are focused on English. Given that Arabic is the fifth most widely - used language in the world, with more than 400 million speakers, there is an urgent need for an LMM evaluation benchmark specifically for Arabic to promote the development and improvement of Arabic LMMs. To solve this problem, the paper proposes **CAMEL - Bench**, which is the first comprehensive Arabic LMM evaluation benchmark. CAMEL - Bench covers eight different domains and 38 sub - domains, including multi - image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant disease identification, and remote - sensing - based land - use understanding, etc. This benchmark contains approximately 29,036 questions, which have been strictly screened and verified by native speakers locally to ensure the high quality and reliability of the data. By constructing CAMEL - Bench, researchers hope: 1. **Fill the gap in Arabic LMM evaluation benchmarks**: Provide a comprehensive and diverse benchmark covering a wide range of multimodal tasks. 2. **Promote the development of Arabic LMMs**: Reveal the deficiencies of existing models through strict evaluation and promote the development of more advanced models. 3. **Improve the performance of models in the Arabic - speaking environment**: Especially in dealing with complex multimodal data, such as OCR, chart understanding, video analysis, etc. In conclusion, CAMEL - Bench aims to provide a reliable evaluation tool for the research and development of Arabic LMMs, thereby promoting further development in this field.

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

A bilingual benchmark for evaluating large language models

CamelEval: Advancing Culturally Aligned Arabic Language Models and Benchmarks

ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models

Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks

AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

A Benchmark Evaluation of Multilingual Large Language Models for Arabic Cross-Lingual Named-Entity Recognition

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish

MMBench: Is Your Multi-modal Model an All-around Player?

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic

MM-Eval: A Hierarchical Benchmark for Modern Mongolian Evaluation in LLMs

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

MileBench: Benchmarking MLLMs in Long Context

MILU: A Multi-task Indic Language Understanding Benchmark

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria