Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Bingchen Zhao,Yongshuo Zong,Letian Zhang,Timothy Hospedales

2024-06-19

Abstract:The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

The paper aims to address the issue of the lack of evaluation standards for Visual Language Models (VLMs) in multi-image understanding. Specifically, while Large Language Models (LLMs) have made significant progress in the field of natural language processing, and multimodal LLMs have extended these capabilities to integrate and interpret visual data, existing VLM evaluation benchmarks mainly focus on single-image input, neglecting the critical aspect of multi-image understanding. Therefore, the paper proposes a Multi-Image Relationship Benchmark (MIRB) to evaluate the ability of VLMs to compare, analyze, and reason about multiple images. MIRB includes four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a range of open-source and closed-source models, the paper reveals that despite the performance of open-source VLMs being close to GPT-4V on single-image tasks, there is still a significant performance gap in multi-image reasoning tasks. Furthermore, even the current state-of-the-art closed-source model GPT-4V does not perform well on the MIRB benchmark, highlighting the challenges and the need for further research in the field of multi-image reasoning. In summary, the main contributions of the paper are: 1. Proposing a comprehensive benchmark MIRB to evaluate different aspects of multi-image understanding, filling an important gap in the evaluation of visual language models. 2. Conducting a detailed evaluation of both open-source and closed-source models, highlighting the current limitations and performance differences in multi-image reasoning. 3. Identifying the challenges and potential areas for improvement in developing visual language models capable of handling and reasoning about multiple images, providing direction for future research and development.

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MMBench: Is Your Multi-modal Model an All-around Player?

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

A Survey on Benchmarks of Multimodal Large Language Models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Beyond Visual Understanding: Introducing PARROT-360V for Vision Language Model Benchmarking

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

MileBench: Benchmarking MLLMs in Long Context

Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

Are We on the Right Way for Evaluating Large Vision-Language Models?

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models