Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Yuan Zang,Tian Yun,Hao Tan,Trung Bui,Chen Sun

2024-04-19

Abstract:Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time? We aim to answer this question as visual concepts learned "for free" would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g. "spiky" as opposed to "spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: After pre - training on contrastive learning objectives with image - text pairs, can pre - trained vision - language models (VLMs) automatically learn visual concepts (such as color, shape, texture, etc.)? Specifically, the authors hope to verify whether these models can directly extract visual concepts through their vision - language interfaces without being specifically trained for these concepts. This involves finding appropriate visual concept prompts and evaluating the quality of the extracted concepts. The authors believe that if pre - trained VLMs have indeed learned these visual concepts, then these concepts can be used for a wide range of purposes, such as neural - symbolic reasoning or human - interpretable object classification. To answer this question, the authors propose a new Concept Discovery and Learning (CDL) framework, which aims to identify a set of general and visually salient visual concepts and select these concepts through multi - modal information (i.e., visual distinctiveness and textual knowledge). In addition, they also design quantitative and human - evaluation methods to measure the precision and thoroughness of the discovered concepts on different visual recognition datasets, thereby confirming that pre - trained VLMs can indeed learn visual concepts that provide accurate and comprehensive descriptions.

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

Language-Informed Visual Concept Learning

The Neglected Tails in Vision-Language Models

Vision-Language Models for Vision Tasks: A Survey

Can Language Models Understand Physical Concepts?

An Introduction to Vision-Language Modeling

Understanding Visual Concepts Across Models

Vision-Language Pre-Training: Basics, Recent Advances, and Future Trends

MyVLM: Personalizing VLMs for User-Specific Queries

A Vision Check-up for Language Models

Visual Concept-Metaconcept Learning

Do Pre-trained Vision-Language Models Encode Object States?

VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

Visually-Augmented Language Modeling

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?

Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies

Towards Better Vision-Inspired Vision-Language Models

A Survey of Vision-Language Pre-Trained Models