Abstract:Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria. See online leaderboard at \url{<a class="link-external link-https" href="https://mllm-bench.llmzoo.com" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the existing evaluation methods for multimodal large language models (MLLMs) mainly focus on objective queries and do not fully consider the user experience in the real world, especially the nuances of creative and relevant multimodal tasks. The open - ended and subjective nature of these tasks makes it very difficult to define "correct answers", thus posing significant challenges to evaluation methods. Specifically, the paper points out that the current evaluation frameworks mainly focus on closed - ended questions, which have clear correct answers. Although these tasks are helpful for quantifying model performance, they do not take into account user experience and do not cover the comprehensive human cognitive tasks that modern MLLMs are designed to perform. Especially in areas such as creativity, association, and ethical judgment, these tasks are difficult to be simplified into simple right - or - wrong answers. To solve this problem, the author proposes a new evaluation paradigm, that is, using a powerful MLLM as an evaluator and combining specific criteria for each sample to evaluate MLLMs. This paradigm shifts from the traditional fixed - answer evaluation to a flexible criteria - based evaluation, which is especially suitable for open - ended tasks. It recognizes and accepts multiple valid answers and evaluates the quality of answers according to the consistency of these answers with the criteria, rather than just a single "correct" answer. In addition, the author has also developed a benchmark test suite named MLLM - Bench, which contains 42 different MLLM functional aspects, distributed across six key ability levels: perception, understanding, application, analysis, evaluation, and creation. These six levels are inspired by Bloom's Taxonomy. The uniqueness of this benchmark test suite is that it may be highly consistent with the questions without static standard answers proposed by users in actual application scenarios.

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

A Survey on Benchmarks of Multimodal Large Language Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Surveying the MLLM Landscape: A Meta-Review of Current Surveys

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Understanding the Role of LLMs in Multimodal Evaluation Benchmarks

LIME: Less Is More for MLLM Evaluation

A Survey on Evaluation of Multimodal Large Language Models

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs)

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

CMMLU: Measuring massive multitask language understanding in Chinese