MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu,Zhengyuan Yang,Linjie Li,Jianfeng Wang,Kevin Lin,Zicheng Liu,Xinchao Wang,Lijuan Wang
2023-10-24
Abstract:We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes. Rapid model advancements pose challenges to evaluation benchmark development. Problems include: (1) How to systematically structure and evaluate the complicated multimodal tasks; (2) How to design evaluation metrics that work well across question and answer types; and (3) How to give model insights beyond a simple performance ranking. To this end, we present MM-Vet, designed based on the insight that the intriguing ability to solve complicated tasks is often achieved by a generalist model being able to integrate different core vision-language (VL) capabilities. MM-Vet defines 6 core VL capabilities and examines the 16 integrations of interest derived from the capability combination. For evaluation metrics, we propose an LLM-based evaluator for open-ended outputs. The evaluator enables the evaluation across different question types and answer styles, resulting in a unified scoring metric. We evaluate representative LMMs on MM-Vet, providing insights into the capabilities of different LMM system paradigms and models. Code and data are available at <a class="link-external link-https" href="https://github.com/yuweihao/MM-Vet" rel="external noopener nofollow">this https URL</a>.
Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to systematically evaluate the performance of large - scale multimodal models (LMMs) on complex multimodal tasks. Specifically, the paper focuses on the following aspects: 1. **How to systematically construct and evaluate complex multimodal tasks**: Existing multimodal benchmarks usually only focus on one or two specific abilities, such as recognition, language generation or OCR, but lack a comprehensive evaluation of more complex tasks. 2. **How to design evaluation metrics applicable to different types of questions and answers**: Different multimodal tasks may require different formats of output. For example, a math problem can be answered with one word, while an essay question requires an answer of several hundred words. Therefore, it is a challenge to design a metric that can uniformly evaluate these different tasks. 3. **How to provide model insights beyond simple performance rankings**: In addition to giving an overall ranking, it is also necessary to conduct in - depth analysis of the model's performance on different tasks in order to better understand its strengths and weaknesses. To address these problems, the paper proposes a new evaluation benchmark - **MM - Vet**. MM - Vet defines six core vision - language (VL) capabilities and examines 16 integrated tasks generated by the combination of these capabilities. These six core capabilities include: - **Recognition**: General visual recognition ability, including the recognition of scenes, objects and their attributes. - **Knowledge**: Covers various knowledge - related tasks, such as social common sense, visual common sense, encyclopedic knowledge, etc. - **OCR**: Scene text understanding and reasoning ability. - **Spatial Awareness**: Understanding the spatial relationships between objects and scene text areas. - **Language Generation**: The ability to generate fluent and informative text output. - **Math**: Arithmetic ability to solve mathematical equations or practical problems. In addition, the paper also proposes an evaluator based on large - language models (LLM) for evaluating the output of open - ended questions. This evaluator can handle different types of answer styles and question types, thereby providing a unified scoring metric. In this way, MM - Vet can not only evaluate the model's performance on different tasks, but also provide more detailed analysis of the model's capabilities, beyond simple overall rankings. In conclusion, this paper aims to systematically evaluate and analyze the performance of large - multimodal models on complex multimodal tasks through the MM - Vet benchmark, thereby providing guidance and reference for the development of multimodal models.