Abstract:While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities and shown potential to serve as general-purpose assistants, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration. In order to assess these unproven abilities of MLLMs, this paper proposes a new visual grounding task called multi-context visual grounding, which aims to localize instances of interest across multiple images based on open-ended text prompts. To facilitate this research, we meticulously construct a new dataset MC-Bench for benchmarking the visual grounding capabilities of MLLMs. MC-Bench features 2K high-quality and manually annotated samples, consisting of instance-level labeled image pairs and corresponding text prompts that indicate the target instances in the images. In total, there are three distinct styles of text prompts, covering 20 practical skills. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities. Our evaluation reveals a non-trivial performance gap between existing MLLMs and humans across all metrics. We also observe that existing MLLMs typically outperform foundation models without LLMs only on image-level metrics, and the specialist MLLMs trained on single images often struggle to generalize to multi-image scenarios. Moreover, a simple stepwise baseline integrating advanced MLLM and a detector can significantly surpass prior end-to-end MLLMs. We hope our MC-Bench and empirical findings can encourage the research community to further explore and enhance the untapped potentials of MLLMs in instance-level tasks, particularly in multi-image contexts. Project page: <a class="link-external link-https" href="https://xuyunqiu.github.io/MC-Bench/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the evaluation of the ability to perform instance - level visual localization based on open - text prompts in multi - image scenarios. Specifically, although existing multimodal large language models (MLLMs) have excellent performance in understanding vision and language and have demonstrated their potential as general - purpose assistants, their ability to handle instance - level vision - language problems beyond a single image has not been fully explored. To evaluate these unvalidated abilities, the paper proposes a new visual localization task - multi - context visual grounding, whose goal is to localize instances of interest according to open - text prompts in multi - image scenarios. To this end, the researchers constructed a new dataset, MC - Bench, for benchmarking MLLMs and base models with potential multi - context visual localization capabilities. ### Main Contributions 1. **Proposing a New Task**: This is the first exploration of the application of MLLMs in multi - image instance - level scenarios in an open environment, and a practical multi - context visual localization task is proposed. 2. **Constructing a New Dataset**: MC - Bench was constructed, which contains 2,000 manually - annotated samples. Each sample consists of an image pair, a text prompt, and the corresponding instance - level label. The diverse images and open - text prompts enable the evaluation of MLLMs from multiple dimensions. 3. **Benchmarking**: More than 20 related MLLMs and base models were benchmarked on MC - Bench, revealing a significant performance gap between existing MLLMs and humans. In addition, in - depth analysis was provided, aiming to guide the development and improvement of MLLMs. ### Dataset Characteristics - **Diversity**: MC - Bench collected 3,345 different images from multiple data sources, covering multiple fields such as natural images, charts, document photos, artworks, and scientific illustrations. - **Text Descriptions**: Three different styles of text prompts (reference, comparison, and reasoning) were designed, covering 20 practical skills. - **Instance - Level Annotations**: It contains 3,202 language - guided bounding box annotations. The text prompt in each positive sample can indicate 1 to 17 instances, divided into 1 to 7 groups. ### Experimental Results - **Performance Evaluation**: Model performance was evaluated using image - level and instance - level metrics. The results show that current MLLMs still have significant room for improvement in some aspects. In particular, small - scale MLLMs (with no more than 7B parameters) have performance at the instance - level comparable to that of base models, but as the model scale increases, the performance improves significantly. - **Baseline Comparison**: A simple step - by - step baseline method (combining advanced MLLM and detector) significantly outperforms previous end - to - end MLLMs. - **Human Evaluation**: Through human evaluation by 3 volunteers, the upper limit of MLLMs was determined, and it was found that humans significantly outperform existing models in all metrics. ### Conclusion The paper hopes to encourage the research community to further explore and enhance the untapped potential of MLLMs in instance - level tasks, especially in multi - image scenarios, through MC - Bench and empirical research.

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

MileBench: Benchmarking MLLMs in Long Context

VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

MMBench: Is Your Multi-modal Model an All-around Player?

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

A Survey on Benchmarks of Multimodal Large Language Models

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs