Abstract:Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development in this domain. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but lack fine-grained ability assessment and robust evaluation metrics. Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of the following key features: 1. MMBench is meticulously curated with well-designed quality control schemes, surpassing existing similar benchmarks in terms of the number and variety of evaluation questions and abilities; 2. MMBench introduces a rigorous CircularEval strategy and incorporates large language models to convert free-form predictions into pre-defined choices, which helps to yield accurate evaluation results for models with limited instruction-following capabilities. 3. MMBench incorporates multiple-choice questions in both English and Chinese versions, enabling an apples-to-apples comparison of VLMs' performance under a bilingual context. To summarize, MMBench is a systematically designed objective benchmark for a robust and holistic evaluation of vision-language models. We hope MMBench will assist the research community in better evaluating their models and facilitate future progress in this area. The evalutation code of MMBench has been integrated into VLMEvalKit: <a class="link-external link-https" href="https://github.com/open-compass/VLMEvalKit" rel="external noopener nofollow">this https URL</a>.

Benchmarking Large Multimodal Models against Common Corruptions

R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions?

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

A Survey on Benchmarks of Multimodal Large Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

MMR: Evaluating Reading Ability of Large Multimodal Models

MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

MileBench: Benchmarking MLLMs in Long Context

Investigating Data Contamination in Modern Benchmarks for Large Language Models

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

MMBench: Is Your Multi-modal Model an All-around Player?

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench