MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Haowei Liu,Xi Zhang,Haiyang Xu,Yaya Shi,Chaoya Jiang,Ming Yan,Ji Zhang,Fei Huang,Chunfeng Yuan,Bing Li,Weiming Hu

2024-10-08

Abstract:Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving the performance of MLLMs when handling realistic multiple images underexplored. Although a few benchmarks consider multiple images, their evaluation dimensions and samples are very limited. In this paper, we propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and transform the original datasets into in-context learning formats. We evaluate several open-source and closed-source MLLMs on the proposed MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs, such as limited fine-grained perception, multi-image reasoning and in-context learning abilities. The annotated data of MIBench is available at <a class="link-external link-https" href="https://huggingface.co/datasets/StarBottle/MIBench" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of insufficient evaluation of current Multimodal Large Language Models (MLLMs) in handling multi-image inputs. Although existing MLLMs and benchmarks primarily focus on single-image input scenarios, multimedia information in the real world often contains multiple images and corresponding text, making multi-image scenarios of greater practical value. However, the exploration and evaluation of MLLMs' performance in multi-image scenarios are currently inadequate. To fill this gap, the paper proposes a new benchmark—MIBench, aimed at comprehensively evaluating MLLMs' fine-grained capabilities in multi-image scenarios. MIBench divides multi-image capabilities into three scenarios: Multi-Image Instruction (MII), Multimodal Knowledge Retrieval (MKS), and Multimodal Context Learning (MIC), and constructs 13 tasks, containing a total of 13,000 annotated samples. Through these tasks, MIBench not only evaluates the models' abilities in multi-image perception, comparison, and reasoning but also examines the models' performance in utilizing external multimodal knowledge and context learning. The main contributions of the paper include: 1. Proposing the first large-scale and comprehensive multi-image capability evaluation benchmark, MIBench, covering three scenarios and 13 tasks. 2. Revealing significant challenges faced by existing MLLMs (especially open-source models) in multi-image scenarios through the evaluation of MIBench, particularly in fine-grained perception and multi-image reasoning. 3. Pointing out that existing MLLMs perform poorly in multimodal knowledge retrieval scenarios and that there is still much room for improvement in multimodal context learning capabilities. In summary, by proposing MIBench, the paper provides an important tool and reference for evaluating and improving MLLMs' capabilities in multi-image scenarios.

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

MMBench: Is Your Multi-modal Model an All-around Player?

MileBench: Benchmarking MLLMs in Long Context

A Survey on Benchmarks of Multimodal Large Language Models

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

MOSABench: Multi-Object Sentiment Analysis Benchmark for Evaluating Multimodal Large Language Models Understanding of Complex Image

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs