Abstract:Multimodal large language models (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpreted within a broader context. In this work, we introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. Further analysis confirms that these models struggle to effectively extract and utilize contextual information to improve their understanding of images. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner. View our project website at <a class="link-external link-https" href="https://thunlp-mt.github.io/CODIS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of the lack of context dependency evaluation in Multimodal Large Language Models (MLLMs) when understanding images. Specifically, most existing benchmarks fail to adequately consider that in some cases, images need to be interpreted within a broader context. As a result, these models perform poorly on context-dependent visual understanding tasks. ### Main Contributions 1. **Emphasizing the Importance of Context Dependency**: The paper points out that MLLMs should have the ability to understand and interpret images in different contexts. 2. **Introducing a New Benchmark CODIS**: CODIS is a new benchmark specifically designed to evaluate MLLMs' context-dependent visual understanding capabilities. It enhances image understanding by providing free-text form contexts. 3. **Revealing Model Deficiencies**: Through analysis, the paper finds that existing MLLMs have significant deficiencies in extracting and utilizing contextual information, highlighting the great potential for improvement in the field of context-dependent visual understanding. ### Specific Methods 1. **Dataset Construction**: - **Image Collection**: Manually collect images that contain ambiguities that can only be resolved through external context. - **Design of Questions, Contexts, and Answers**: Manually write questions, contexts, and answers for each image. Each image-question pair is provided with two different contexts, leading to different interpretations and different answers. - **Data Validation**: Five annotators participate in the data collection process to ensure the correctness and diversity of the data. 2. **Evaluation Methods**: - **Pairwise Accuracy (Accp)**: The model scores when it correctly answers both queries in a pair. - **Single Query Accuracy (Accq)**: The model scores when it correctly answers each individual query. ### Experimental Results - **Overall Results**: Humans outperform all MLLMs across all categories, indicating a significant room for improvement in MLLMs' context-dependent visual understanding. - **API Models vs. Open-Source Models**: API models significantly outperform open-source models, with GPT-4V performing the best. - **Context Awareness**: MLLMs show weak ability in recognizing different contexts and providing different responses, leading to a large gap between Accp and Accq. - **Performance Across Different Categories**: Among the five categories, most MLLMs perform best in the relationship category and worst in the cultural background category. ### Conclusion By introducing the CODIS benchmark, the paper emphasizes the importance of context dependency in multimodal large language models and reveals the deficiencies of existing models in this area. Future research should focus on how to improve MLLMs' capabilities in context-dependent visual understanding.

CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Contextual Object Detection with Multimodal Large Language Models

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

A Survey on Benchmarks of Multimodal Large Language Models

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

VL-ICL Bench: The Devil in the Details of Multimodal In-Context Learning

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Can MLLMs Perform Text-to-Image In-Context Learning?

Exploring the Design Space of Visual Context Representation in Video MLLMs

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models