MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu,Peixian Chen,Yunhang Shen,Yulei Qin,Mengdan Zhang,Xu Lin,Jinrui Yang,Xiawu Zheng,Ke Li,Xing Sun,Yunsheng Wu,Rongrong Ji

2024-03-17

Abstract:Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data application manner and online leaderboards are released at <a class="link-external link-https" href="https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of the lack of comprehensive evaluation benchmarks for Multimodal Large Language Models (MLLM). Specifically, although existing MLLMs exhibit surprising capabilities in handling multimodal tasks, there is currently no unified and comprehensive evaluation benchmark to measure the performance of these models. The paper proposes MME (Multimodal Large Language Model Evaluation benchmark), which is a comprehensive benchmark specifically designed to evaluate the performance of MLLMs in terms of perceptual and cognitive abilities. By covering 14 sub-tasks, MME aims to fill this gap and provide guidance for the optimization of future models. Additionally, MME has designed concise and clear instructions to avoid the impact of prompt engineering on model outputs and to ensure the objectivity and accuracy of the evaluation results. Through zero-shot performance evaluation of 30 advanced MLLMs, the paper reveals issues in existing models regarding basic instruction following, basic perception and reasoning, and object hallucination, providing valuable insights for subsequent research.

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A Survey on Benchmarks of Multimodal Large Language Models

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

A Survey on Evaluation of Multimodal Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Evaluating and Advancing Multimodal Large Language Models in Ability Lens

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

CMMLU: Measuring massive multitask language understanding in Chinese