MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Hang Hua,Yunlong Tang,Ziyun Zeng,Liangliang Cao,Zhengyuan Yang,Hangfeng He,Chenliang Xu,Jiebo Luo
2024-10-13
Abstract:The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling more sophisticated and accurate integration of visual and textual information across various tasks, including image and video captioning, visual question answering, and cross-modal retrieval. Despite VLMs' superior capabilities, researchers lack a comprehensive understanding of their compositionality -- the ability to understand and produce novel combinations of known visual and textual components. Prior benchmarks provide only a relatively rough compositionality evaluation from the perspectives of objects, relations, and attributes while neglecting deeper reasoning about object interactions, counting, and complex compositions. However, compositionality is a critical ability that facilitates coherent reasoning and understanding across modalities for VLMs. To address this limitation, we propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating VLMs' compositionality. Our proposed benchmark serves as a complement to these earlier works. With MMCOMPOSITION, we can quantify and explore the compositionality of the mainstream VLMs. Surprisingly, we find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons. Our experimental analysis reveals the limitations of VLMs in fine-grained compositional perception and reasoning, and points to areas for improvement in VLM design and training. Resources available at: <a class="link-external link-https" href="https://hanghuacs.github.io/MMComposition/" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the current insufficiency of visual - language models (VLMs) in understanding and generating complex visual - text combined information. Although existing large - scale visual - language models perform well in a variety of tasks, such as image and video captioning, visual question answering, and cross - modal retrieval, they still have obvious flaws in fine - grained combinatorial understanding, especially in dealing with object counting, complex scene understanding, and object interaction. These problems reveal the gap in combinatorial ability between humans and models, and combinatorial ability is crucial for achieving more complex tasks such as image captioning, visual question answering, and scene understanding. In order to more comprehensively evaluate the combinatorial ability of these models, the author proposes a new benchmarking tool - **MMC OMPOSITION**, which is a high - quality manually - annotated benchmark, aiming to comprehensively evaluate pre - trained visual - language models from multiple dimensions (such as combinatorial perception, reasoning, and probing). MMC OMPOSITION contains 13 different categories of questions, covering scenarios from single - picture to multi - picture, as well as various forms of tasks such as single - choice questions and multiple - choice questions, thus providing a more comprehensive and in - depth evaluation framework that goes beyond previous benchmarks. Through this new benchmark, the author not only reveals the limitations of the current state - of - the - art visual - language models in combinatorial understanding but also analyzes the key factors affecting the model's combinatorial ability, including visual encoder design, language decoder size, and the amount of training data. The study found that even advanced models such as GPT - 4o perform poorly on tasks requiring detailed combinatorial reasoning, indicating that future research and development need to further enhance the combinatorial ability of these models.