MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying,Fanqing Meng,Jin Wang,Zhiqian Li,Han Lin,Yue Yang,Hao Zhang,Wenbo Zhang,Yuqi Lin,Shuo Liu,Jiayi Lei,Quanfeng Lu,Runjian Chen,Peng Xu,Renrui Zhang,Haozhe Zhang,Peng Gao,Yali Wang,Yu Qiao,Ping Luo,Kaipeng Zhang,Wenqi Shao

2024-04-25

Abstract:Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is that existing multimodal evaluation benchmarks have a limited scope of tasks when testing large vision-language models (LVLMs), primarily focusing on basic multimodal capabilities such as simple visual recognition and text-sparse OCR tasks. These benchmarks fail to comprehensively assess the capabilities of LVLMs in multitask general artificial intelligence (AGI). To tackle this challenge, the paper proposes a new benchmark called MMT-Bench, which aims to comprehensively evaluate the performance of LVLMs in multimodal multitask understanding. Specifically, MMT-Bench includes 31,325 carefully curated multiple-choice visual questions, covering 32 core meta-tasks and 162 sub-tasks, involving 13 types of images (such as natural scenes, synthetic images, depth maps, rich-text images, paintings, screenshots, point clouds, medical images, etc.), as well as various multimodal scenarios (such as vehicle driving, GUI navigation, embodied AI, etc.). These features enable MMT-Bench to test the capabilities of LVLMs in visual recognition, localization, reasoning, OCR, counting, 3D perception, temporal understanding, and more, thereby providing a more comprehensive and rigorous evaluation standard for the development of LVLMs.

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Are We on the Right Way for Evaluating Large Vision-Language Models?

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

A Survey on Benchmarks of Multimodal Large Language Models

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

MIBench: Evaluating Multimodal Large Language Models over Multiple Images