Abstract:The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: <a class="link-external link-https" href="https://hanghuacs.github.io/MMComposition/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the current insufficiency of visual - language models (VLMs) in understanding and generating complex visual - text combined information. Although existing large - scale visual - language models perform well in a variety of tasks, such as image and video captioning, visual question answering, and cross - modal retrieval, they still have obvious flaws in fine - grained combinatorial understanding, especially in dealing with object counting, complex scene understanding, and object interaction. These problems reveal the gap in combinatorial ability between humans and models, and combinatorial ability is crucial for achieving more complex tasks such as image captioning, visual question answering, and scene understanding. In order to more comprehensively evaluate the combinatorial ability of these models, the author proposes a new benchmarking tool - **MMC OMPOSITION**, which is a high - quality manually - annotated benchmark, aiming to comprehensively evaluate pre - trained visual - language models from multiple dimensions (such as combinatorial perception, reasoning, and probing). MMC OMPOSITION contains 13 different categories of questions, covering scenarios from single - picture to multi - picture, as well as various forms of tasks such as single - choice questions and multiple - choice questions, thus providing a more comprehensive and in - depth evaluation framework that goes beyond previous benchmarks. Through this new benchmark, the author not only reveals the limitations of the current state - of - the - art visual - language models in combinatorial understanding but also analyzes the key factors affecting the model's combinatorial ability, including visual encoder design, language decoder size, and the amount of training data. The study found that even advanced models such as GPT - 4o perform poorly on tasks requiring detailed combinatorial reasoning, indicating that future research and development need to further enhance the combinatorial ability of these models.

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Compositional Substitutivity of Visual Reasoning for Visual Question Answering

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

In-Context Compositional Generalization for Large Vision-Language Models

Visualizing and Understanding Neural Models in NLP

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

An Examination of the Compositionality of Large Generative Vision-Language Models

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Model Composition for Multimodal Large Language Models

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

COLA: A Benchmark for Compositional Text-to-image Retrieval

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Prompting Large Vision-Language Models for Compositional Reasoning

In-Context Learning Improves Compositional Understanding of Vision-Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI