Abstract:We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models. Code and data are available at <a class="link-external link-https" href="https://github.com/yuweihao/MM-Vet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to systematically evaluate the performance of large - scale multimodal models (LMMs) on complex multimodal tasks. Specifically, the paper focuses on the following aspects: 1. **How to systematically construct and evaluate complex multimodal tasks**: Existing multimodal benchmarks usually only focus on one or two specific abilities, such as recognition, language generation or OCR, but lack a comprehensive evaluation of more complex tasks. 2. **How to design evaluation metrics applicable to different types of questions and answers**: Different multimodal tasks may require different formats of output. For example, a math problem can be answered with one word, while an essay question requires an answer of several hundred words. Therefore, it is a challenge to design a metric that can uniformly evaluate these different tasks. 3. **How to provide model insights beyond simple performance rankings**: In addition to giving an overall ranking, it is also necessary to conduct in - depth analysis of the model's performance on different tasks in order to better understand its strengths and weaknesses. To address these problems, the paper proposes a new evaluation benchmark - **MM - Vet**. MM - Vet defines six core vision - language (VL) capabilities and examines 16 integrated tasks generated by the combination of these capabilities. These six core capabilities include: - **Recognition**: General visual recognition ability, including the recognition of scenes, objects and their attributes. - **Knowledge**: Covers various knowledge - related tasks, such as social common sense, visual common sense, encyclopedic knowledge, etc. - **OCR**: Scene text understanding and reasoning ability. - **Spatial Awareness**: Understanding the spatial relationships between objects and scene text areas. - **Language Generation**: The ability to generate fluent and informative text output. - **Math**: Arithmetic ability to solve mathematical equations or practical problems. In addition, the paper also proposes an evaluator based on large - language models (LLM) for evaluating the output of open - ended questions. This evaluator can handle different types of answer styles and question types, thereby providing a unified scoring metric. In this way, MM - Vet can not only evaluate the model's performance on different tasks, but also provide more detailed analysis of the model's capabilities, beyond simple overall rankings. In conclusion, this paper aims to systematically evaluate and analyze the performance of large - multimodal models on complex multimodal tasks through the MM - Vet benchmark, thereby providing guidance and reference for the development of multimodal models.

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

MMBench: Is Your Multi-modal Model an All-around Player?

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks

Are We on the Right Way for Evaluating Large Vision-Language Models?

A Survey on Benchmarks of Multimodal Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

A Survey on Evaluation of Multimodal Large Language Models

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos