Abstract:The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at <a class="link-external link-https" href="https://github.com/declare-lab/MM-BigBench" rel="external noopener nofollow">this https URL</a>.

MMR: Evaluating Reading Ability of Large Multimodal Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

MULTI: Multimodal Understanding Leaderboard with Text and Images

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

MileBench: Benchmarking MLLMs in Long Context