MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying,Fanqing Meng,Jin Wang,Zhiqian Li,Han Lin,Yue Yang,Hao Zhang,Wenbo Zhang,Yuqi Lin,Shuo Liu,Jiayi Lei,Quanfeng Lu,Runjian Chen,Peng Xu,Renrui Zhang,Haozhe Zhang,Peng Gao,Yali Wang,Yu Qiao,Ping Luo,Kaipeng Zhang,Wenqi Shao
2024-04-25
Abstract:Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing multimodal evaluation benchmarks have a limited scope of tasks when testing large vision-language models (LVLMs), primarily focusing on basic multimodal capabilities such as simple visual recognition and text-sparse OCR tasks. These benchmarks fail to comprehensively assess the capabilities of LVLMs in multitask general artificial intelligence (AGI). To tackle this challenge, the paper proposes a new benchmark called MMT-Bench, which aims to comprehensively evaluate the performance of LVLMs in multimodal multitask understanding. Specifically, MMT-Bench includes 31,325 carefully curated multiple-choice visual questions, covering 32 core meta-tasks and 162 sub-tasks, involving 13 types of images (such as natural scenes, synthetic images, depth maps, rich-text images, paintings, screenshots, point clouds, medical images, etc.), as well as various multimodal scenarios (such as vehicle driving, GUI navigation, embodied AI, etc.). These features enable MMT-Bench to test the capabilities of LVLMs in visual recognition, localization, reasoning, OCR, counting, 3D perception, temporal understanding, and more, thereby providing a more comprehensive and rigorous evaluation standard for the development of LVLMs.