MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Wentao Ge,Shunian Chen,Guiming Hardy Chen,Junying Chen,Zhihong Chen,Nuo Chen,Wenya Xie,Shuo Yan,Chenghao Zhu,Ziyue Lin,Song Dingjie,Xidong Wang,Anningzhe Gao,Zhang Zhiyi,Jianquan Li,Xiang Wan,Benyou Wang
2024-09-15
Abstract:Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria. See online leaderboard at \url{<a class="link-external link-https" href="https://mllm-bench.llmzoo.com" rel="external noopener nofollow">this https URL</a>}.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the existing evaluation methods for multimodal large language models (MLLMs) mainly focus on objective queries and do not fully consider the user experience in the real world, especially the nuances of creative and relevant multimodal tasks. The open - ended and subjective nature of these tasks makes it very difficult to define "correct answers", thus posing significant challenges to evaluation methods. Specifically, the paper points out that the current evaluation frameworks mainly focus on closed - ended questions, which have clear correct answers. Although these tasks are helpful for quantifying model performance, they do not take into account user experience and do not cover the comprehensive human cognitive tasks that modern MLLMs are designed to perform. Especially in areas such as creativity, association, and ethical judgment, these tasks are difficult to be simplified into simple right - or - wrong answers. To solve this problem, the author proposes a new evaluation paradigm, that is, using a powerful MLLM as an evaluator and combining specific criteria for each sample to evaluate MLLMs. This paradigm shifts from the traditional fixed - answer evaluation to a flexible criteria - based evaluation, which is especially suitable for open - ended tasks. It recognizes and accepts multiple valid answers and evaluates the quality of answers according to the consistency of these answers with the criteria, rather than just a single "correct" answer. In addition, the author has also developed a benchmark test suite named MLLM - Bench, which contains 42 different MLLM functional aspects, distributed across six key ability levels: perception, understanding, application, analysis, evaluation, and creation. These six levels are inspired by Bloom's Taxonomy. The uniqueness of this benchmark test suite is that it may be highly consistent with the questions without static standard answers proposed by users in actual application scenarios.