MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Jinsheng Huang,Liang Chen,Taian Guo,Fu Zeng,Yusheng Zhao,Bohan Wu,Ye Yuan,Haozhe Zhao,Zhihui Guo,Yichi Zhang,Jingyang Yuan,Wei Ju,Luchen Liu,Tianyu Liu,Baobao Chang,Ming Zhang

2024-06-29

Abstract:Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73\%$, compared to an average gap of $8.03\%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09\%$, whereas the gap for previous benchmarks is just $14.64\%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

This paper focuses on the issue of systematic bias in the evaluation of Language-Only Multimodal Models (LMMs). In existing multimodal benchmark tests, Large Language Models (LLMs) have achieved remarkable performance even without visual perceptual capabilities, which undermines the credibility of these evaluations. To address this problem, the paper proposes MME VALPRO, a benchmark designed with a triplet evaluation pipeline and stricter metrics to avoid Type I errors, enhance the reliability and efficiency of the evaluation. MME VALPRO achieves this by introducing human annotations on existing benchmark questions, adding perception queries and knowledge anchor questions to ensure that the models understand images, text, and related knowledge simultaneously. The paper conducts two experiments: the Seeing-or-Not Comparison and the Answer Consistency Test, revealing the issue of distrust in multimodal benchmark tests. The experiments demonstrate that LLMs can achieve high scores in certain benchmark tests even without processing visual data, possibly due to data leakage, reliance solely on text information, or guessing answers. To improve the accuracy of the evaluation, MME VALPRO requires models to correctly answer both the original question and additional perception and knowledge questions, using accuracy as the main metric. The experimental results show that compared to existing benchmarks, MME VALPRO poses greater challenges to LMMs and exhibits larger performance gaps between human and model. The paper also analyzes the reasons for the performance gaps and demonstrates the credibility of the MME VALPRO evaluation, providing valuable references for future research.

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

MMBench: Is Your Multi-modal Model an All-around Player?

A Survey on Benchmarks of Multimodal Large Language Models

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

CMMMU: A Chinese Massive Multi-discipline Multimodal Understanding Benchmark

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

LIME: Less Is More for MLLM Evaluation

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark