Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

Bingchen Zhao,Yongshuo Zong,Letian Zhang,Timothy Hospedales
2024-06-19
Abstract:The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of the lack of evaluation standards for Visual Language Models (VLMs) in multi-image understanding. Specifically, while Large Language Models (LLMs) have made significant progress in the field of natural language processing, and multimodal LLMs have extended these capabilities to integrate and interpret visual data, existing VLM evaluation benchmarks mainly focus on single-image input, neglecting the critical aspect of multi-image understanding. Therefore, the paper proposes a Multi-Image Relationship Benchmark (MIRB) to evaluate the ability of VLMs to compare, analyze, and reason about multiple images. MIRB includes four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a range of open-source and closed-source models, the paper reveals that despite the performance of open-source VLMs being close to GPT-4V on single-image tasks, there is still a significant performance gap in multi-image reasoning tasks. Furthermore, even the current state-of-the-art closed-source model GPT-4V does not perform well on the MIRB benchmark, highlighting the challenges and the need for further research in the field of multi-image reasoning. In summary, the main contributions of the paper are: 1. Proposing a comprehensive benchmark MIRB to evaluate different aspects of multi-image understanding, filling an important gap in the evaluation of visual language models. 2. Conducting a detailed evaluation of both open-source and closed-source models, highlighting the current limitations and performance differences in multi-image reasoning. 3. Identifying the challenges and potential areas for improvement in developing visual language models capable of handling and reasoning about multiple images, providing direction for future research and development.