CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models

Fuwen Luo,Chi Chen,Zihao Wan,Zhaolu Kang,Qidong Yan,Yingjie Li,Xiaolong Wang,Siyu Wang,Ziyue Wang,Xiaoyue Mi,Peng Li,Ning Ma,Maosong Sun,Yang Liu
2024-06-05
Abstract:Multimodal large language models (MLLMs) have demonstrated promising results in a variety of tasks that combine vision and language. As these models become more integral to research and applications, conducting comprehensive evaluations of their capabilities has grown increasingly important. However, most existing benchmarks fail to consider that, in certain situations, images need to be interpreted within a broader context. In this work, we introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. Further analysis confirms that these models struggle to effectively extract and utilize contextual information to improve their understanding of images. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner. View our project website at <a class="link-external link-https" href="https://thunlp-mt.github.io/CODIS" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of the lack of context dependency evaluation in Multimodal Large Language Models (MLLMs) when understanding images. Specifically, most existing benchmarks fail to adequately consider that in some cases, images need to be interpreted within a broader context. As a result, these models perform poorly on context-dependent visual understanding tasks. ### Main Contributions 1. **Emphasizing the Importance of Context Dependency**: The paper points out that MLLMs should have the ability to understand and interpret images in different contexts. 2. **Introducing a New Benchmark CODIS**: CODIS is a new benchmark specifically designed to evaluate MLLMs' context-dependent visual understanding capabilities. It enhances image understanding by providing free-text form contexts. 3. **Revealing Model Deficiencies**: Through analysis, the paper finds that existing MLLMs have significant deficiencies in extracting and utilizing contextual information, highlighting the great potential for improvement in the field of context-dependent visual understanding. ### Specific Methods 1. **Dataset Construction**: - **Image Collection**: Manually collect images that contain ambiguities that can only be resolved through external context. - **Design of Questions, Contexts, and Answers**: Manually write questions, contexts, and answers for each image. Each image-question pair is provided with two different contexts, leading to different interpretations and different answers. - **Data Validation**: Five annotators participate in the data collection process to ensure the correctness and diversity of the data. 2. **Evaluation Methods**: - **Pairwise Accuracy (Accp)**: The model scores when it correctly answers both queries in a pair. - **Single Query Accuracy (Accq)**: The model scores when it correctly answers each individual query. ### Experimental Results - **Overall Results**: Humans outperform all MLLMs across all categories, indicating a significant room for improvement in MLLMs' context-dependent visual understanding. - **API Models vs. Open-Source Models**: API models significantly outperform open-source models, with GPT-4V performing the best. - **Context Awareness**: MLLMs show weak ability in recognizing different contexts and providing different responses, leading to a large gap between Accp and Accq. - **Performance Across Different Categories**: Among the five categories, most MLLMs perform best in the relationship category and worst in the cultural background category. ### Conclusion By introducing the CODIS benchmark, the paper emphasizes the importance of context dependency in multimodal large language models and reveals the deficiencies of existing models in this area. Future research should focus on how to improve MLLMs' capabilities in context-dependent visual understanding.