CAMEL-Bench: A Comprehensive Arabic LMM Benchmark

Sara Ghaboura,Ahmed Heakl,Omkar Thawakar,Ali Alharthi,Ines Riahi,Abduljalil Saif,Jorma Laaksonen,Fahad S. Khan,Salman Khan,Rao M. Anwer
2024-10-25
Abstract:Recent years have witnessed a significant interest in developing large multimodal models (LMMs) capable of performing various visual reasoning and understanding tasks. This has led to the introduction of multiple LMM benchmarks to evaluate LMMs on different tasks. However, most existing LMM evaluation benchmarks are predominantly English-centric. In this work, we develop a comprehensive LMM evaluation benchmark for the Arabic language to represent a large population of over 400 million speakers. The proposed benchmark, named CAMEL-Bench, comprises eight diverse domains and 38 sub-domains including, multi-image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant diseases, and remote sensing-based land use understanding to evaluate broad scenario generalizability. Our CAMEL-Bench comprises around 29,036 questions that are filtered from a larger pool of samples, where the quality is manually verified by native speakers to ensure reliable model assessment. We conduct evaluations of both closed-source, including GPT-4 series, and open-source LMMs. Our analysis reveals the need for substantial improvement, especially among the best open-source models, with even the closed-source GPT-4o achieving an overall score of 62%. Our benchmark and evaluation scripts are open-sourced.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Computers and Society,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Existing large - scale multimodal models (LMMs) evaluation benchmarks are mainly English - centric, lacking a comprehensive and diverse Arabic LMM evaluation benchmark**. This restricts the application and improvement of these models in the Arabic - speaking environment. Specifically, the paper points out that although multimodal models have made significant progress in tasks such as visual reasoning, perception, and understanding in recent years and multiple LMM evaluation benchmarks have been introduced, most of the existing benchmarks are focused on English. Given that Arabic is the fifth most widely - used language in the world, with more than 400 million speakers, there is an urgent need for an LMM evaluation benchmark specifically for Arabic to promote the development and improvement of Arabic LMMs. To solve this problem, the paper proposes **CAMEL - Bench**, which is the first comprehensive Arabic LMM evaluation benchmark. CAMEL - Bench covers eight different domains and 38 sub - domains, including multi - image understanding, complex visual perception, handwritten document understanding, video understanding, medical imaging, plant disease identification, and remote - sensing - based land - use understanding, etc. This benchmark contains approximately 29,036 questions, which have been strictly screened and verified by native speakers locally to ensure the high quality and reliability of the data. By constructing CAMEL - Bench, researchers hope: 1. **Fill the gap in Arabic LMM evaluation benchmarks**: Provide a comprehensive and diverse benchmark covering a wide range of multimodal tasks. 2. **Promote the development of Arabic LMMs**: Reveal the deficiencies of existing models through strict evaluation and promote the development of more advanced models. 3. **Improve the performance of models in the Arabic - speaking environment**: Especially in dealing with complex multimodal data, such as OCR, chart understanding, video analysis, etc. In conclusion, CAMEL - Bench aims to provide a reliable evaluation tool for the research and development of Arabic LMMs, thereby promoting further development in this field.