Abstract:Several benchmarks have concluded that our best vision-language models (e.g., CLIP) are lacking in compositionality. Given an image, these benchmarks probe a model's ability to identify its associated caption amongst a set of compositional distractors. In response, a surge of recent proposals show improvements by finetuning CLIP with distractors as hard negatives. Our investigations reveal that these improvements have, in fact, been significantly overstated -- because existing benchmarks do not probe whether finetuned vision-language models remain invariant to hard positives. By curating an evaluation dataset with 112,382 hard negatives and hard positives, we uncover that including hard positives decreases CLIP's performance by 12.9%, while humans perform effortlessly at 99%. CLIP finetuned with hard negatives results in an even larger decrease, up to 38.7%. With this finding, we then produce a 1,775,259 image-text training set with both hard negative and hard positive captions. By training with both, we see improvements on existing benchmarks while simultaneously improving performance on hard positives, indicating a more robust improvement in compositionality. Our work suggests the need for future research to rigorously test and improve CLIP's understanding of semantic relationships between related "positive" concepts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of current vision - language models (such as CLIP) in handling compositional tasks. In particular, after being fine - tuned with hard negative examples, these models, although their performance has improved in existing benchmark tests, fail to fully understand the subtle changes in semantic relationships, resulting in being overly sensitive to hard positive examples (i.e., subtle modifications that keep the semantics unchanged). Specifically: 1. **Limitations of existing work**: Existing research mainly fine - tunes models by using hard negatives to improve the models' compositional capabilities. However, these methods do not consider whether the models can maintain invariance to hard positives. Hard positives refer to semantically - preserved modifications to the original description. For example, changing "A brown dog is grabbing a white frisbee" to "A brown dog is holding a white frisbee", and such a modification should not affect the model's matching degree to the image. 2. **Side effects of hard negative fine - tuning**: The paper points out that although the existing hard negative fine - tuning methods improve the performance of the models on some benchmark tests, they also cause the models to be overly sensitive to hard positives. This means that when the models encounter semantically - preserved modifications, they may wrongly reduce their matching scores, thus affecting the overall performance of the models. 3. **Importance of introducing hard positives**: In order to more comprehensively evaluate and improve the compositional capabilities of the models, the authors propose a new evaluation dataset that contains a large number of hard positives and hard negatives. Through this dataset, the authors find that existing models perform poorly when dealing with hard positives, while humans perform extremely well on the same task (reaching 99% accuracy). 4. **Solution**: In order to overcome the over - sensitivity problem brought by hard negative fine - tuning, the authors propose a new training method, that is, fine - tuning with both hard negatives and hard positives simultaneously. The experimental results show that this method not only achieves better results on existing benchmark tests but also performs more robustly when dealing with hard positives. In conclusion, this paper aims to reveal the deficiencies of existing vision - language models in handling compositional tasks and proposes a new training method to improve the models' understanding ability of semantic relationships, especially their performance when facing hard positives.

The Hard Positive Truth about Vision-Language Compositionality

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

Semantic Compositions Enhance Vision-Language Contrastive Learning

COLA: A Benchmark for Compositional Text-to-image Retrieval

MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Interpretable Composition Attribution Enhancement for Visio-linguistic Compositional Understanding

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval

Iterated Learning Improves Compositionality in Large Vision-Language Models

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models

Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

Natural Language Inference Improves Compositionality in Vision-Language Models

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition

ComCLIP: Training-Free Compositional Image and Text Matching

In-Context Learning Improves Compositional Understanding of Vision-Language Models