Abstract:Compositional reasoning is a hallmark of human visual intelligence. Yet, despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. To solve Cola, a model must retrieve images with the correct configuration of attributes and objects and avoid choosing a distractor image with the same objects and attributes but in the wrong configuration. Cola contains about 1.2k composed queries of 168 objects and 197 attributes on around 30K images. Our human evaluation finds that Cola is 83.33% accurate, similar to contemporary compositionality benchmarks. Using Cola as a testbed, we explore empirical modeling designs to adapt pre-trained vision-language models to reason compositionally. We explore 6 adaptation strategies on 2 seminal vision-language models, using compositionality-centric test benchmarks - Cola and CREPE. We find the optimal adaptation strategy is to train a multi-modal attention layer that jointly attends over the frozen pre-trained image and language features. Surprisingly, training multimodal layers on CLIP performs better than tuning a larger FLAVA model with already pre-trained multimodal layers. Furthermore, our adaptation strategy improves CLIP and FLAVA to comparable levels, suggesting that training multimodal layers using contrastive attribute-object data is key, as opposed to using them pre-trained. Lastly, we show that Cola is harder than a closely related contemporary benchmark, CREPE, since simpler fine-tuning strategies without multimodal layers suffice on CREPE but not on Cola. However, we still see a significant gap between our best adaptation and human accuracy, suggesting considerable room for further research.

Compositional Kronecker Context Optimization for Vision-Language Models

In-Context Compositional Generalization for Large Vision-Language Models

Learning to Prompt for Vision-Language Models

Visual-Language Prompt Tuning with Knowledge-guided Context Optimization

Improving Knowledge Graph Representation Learning by Structure Contextual Pre-training

LoCoCo: Dropping In Convolutions for Long Context Compression

Towards Compatible Fine-tuning for Vision-Language Model Updates

Optimization of Prompt Learning via Multi-Knowledge Representation for Vision-Language Models

Efficient Large Multi-modal Models via Visual Context Compression

Exploring Diverse In-Context Configurations for Image Captioning

In-Context Learning Improves Compositional Understanding of Vision-Language Models

Conceptual Codebook Learning for Vision-Language Models

Modal interaction-enhanced prompt learning by transformer decoder for vision-language models

Iterated Learning Improves Compositionality in Large Vision-Language Models

COLA: A Benchmark for Compositional Text-to-image Retrieval

IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

Learning Visual Composition through Improved Semantic Guidance

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval

Conditional Kronecker Batch Normalization for Compositional Reasoning.

Towards Understanding the Relationship between In-context Learning and Compositional Generalization