Abstract:Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in <a class="link-external link-https" href="https://mmie-bench.github.io/" rel="external noopener nofollow">this https URL</a>.

MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing

MC-MKE: A Fine-Grained Multimodal Knowledge Editing Benchmark Emphasizing Modality Consistency

Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models

KEBench: A Benchmark on Knowledge Editing for Large Vision-Language Models

Towards Unified Multimodal Editing with Enhanced Knowledge Collaboration

MEMLA: Enhancing Multilingual Knowledge Editing with Neuron-Masked Low-Rank Adaptation

BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning

VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark

Multilingual Knowledge Editing with Language-Agnostic Factual Neurons

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Multimodal Entity Tagging with Multimodal Knowledge Base

Retrieval-augmented Multilingual Knowledge Editing

MULFE: A Multi-Level Benchmark for Free Text Model Editing

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Editing Conceptual Knowledge for Large Language Models

Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

FAME: Towards Factual Multi-Task Model Editing

Can We Edit Multimodal Large Language Models?