Abstract:In recent years large visual-language (V+L) models have achieved great success in various downstream tasks. However, it is not well studied whether these models have a conceptual grasp of the visual content. In this work we focus on conceptual understanding of these large V+L models. To facilitate this study, we propose novel benchmarking datasets for probing three different aspects of content understanding, 1) \textit{relations}, 2) \textit{composition}, and 3) \textit{context}. Our probes are grounded in cognitive science and help determine if a V+L model can, for example, determine if snow garnished with a man is implausible, or if it can identify beach furniture by knowing it is located on a beach. We experimented with many recent state-of-the-art V+L models and observe that these models mostly \textit{fail to demonstrate} a conceptual understanding. This study reveals several interesting insights such as that \textit{cross-attention} helps learning conceptual understanding, and that CNNs are better with \textit{texture and patterns}, while Transformers are better at \textit{color and shape}. We further utilize some of these insights and investigate a \textit{simple finetuning technique} that rewards the three conceptual understanding measures with promising initial results. The proposed benchmarks will drive the community to delve deeper into conceptual understanding and foster advancements in the capabilities of large V+L models. The code and dataset is available at: \url{<a class="link-external link-https" href="https://tinyurl.com/vlm-robustness" rel="external noopener nofollow">this https URL</a>}

Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts

In-Context Compositional Generalization for Large Vision-Language Models

The Scenario Refiner: Grounding subjects in images at the morphological level

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Superlatives in Context: Modeling the Implicit Semantics of Superlatives

A Large-Scale Multilingual Study of Visual Constraints on Linguistic Selection of Descriptions

Probing Conceptual Understanding of Large Visual-Language Models

GeoMeter: Probing Depth and Height Perception of Large Visual-Language Models

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models

Language-Image Models with 3D Understanding

Matryoshka Multimodal Models

Efficient Large Multi-modal Models via Visual Context Compression

MAGNIFICo: Evaluating the In-Context Learning Ability of Large Language Models to Generalize to Novel Interpretations

Naming, Describing, and Quantifying Visual Objects in Humans and LLMs

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

Exploring Perceptual Limitation of Multimodal Large Language Models

Cabbage Sweeter than Cake? Analysing the Potential of Large Language Models for Learning Conceptual Spaces

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

Understanding Visual Concepts Across Models

Probing Large Language Models for Scalar Adjective Lexical Semantics and Scalar Diversity Pragmatics

Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models