Abstract:Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: <a class="link-external link-https" href="https://tripletclip.github.io" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper "TripletCLIP: Improving Compositional Reasoning of CLIP with Synthetic Visual-Linguistic Hard Negatives" aims to address the shortcomings of current contrastive learning image-text pre-training models (such as CLIP) in compositional reasoning. Specifically, although the CLIP model learns representations by maximizing the mutual information between text and visual modalities, it performs poorly in handling complex compositional tasks due to the lack of compositional diversity in existing image-text datasets. For example, CLIP struggles to distinguish expressions like "bulb in the grass" and "grass in the bulb." To solve this problem, the authors propose a new contrastive pre-training strategy called TripletCLIP. This method enhances CLIP's compositional reasoning ability by generating "hard" negative samples (i.e., samples that are very similar to positive samples but semantically different) and alternately using these negative samples during training. The specific steps include: 1. **Generating Hard Negative Samples**: - Using a large language model (LLM) to generate hard negative text descriptions. - Utilizing a pre-trained text-to-image generation model to generate corresponding negative images. 2. **Introducing a Triplet Contrastive Loss Function**: - Adding an additional supervision term to the traditional contrastive loss function to ensure that the negative image is closer to its corresponding negative text description rather than the positive text description. Using this method, TripletCLIP conducted experiments on existing datasets (such as CC3M and CC12M) and demonstrated significant performance improvements in multiple downstream tasks, particularly in compositional reasoning tasks. For example, in the SugarCrepe benchmark test, TripletCLIP achieved an absolute improvement of over 9% compared to other methods. ### Summary The main contributions of the paper are: - Introducing a new CLIP pre-training strategy that enhances the model's compositional reasoning ability by generating and utilizing hard negative samples (text and images). - Proposing a triplet contrastive loss function that effectively leverages these hard negative samples. - Experimental results show that TripletCLIP performs excellently in multiple downstream tasks, especially in compositional reasoning tasks. Through these innovations, the paper provides new ideas and methods for improving the compositional reasoning ability of multimodal models.

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Semantic Compositions Enhance Vision-Language Contrastive Learning

ComCLIP: Training-Free Compositional Image and Text Matching

The Hard Positive Truth about Vision-Language Compositionality

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Improving CLIP Training with Language Rewrites

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

DiffCLIP: Few-shot Language-driven Multimodal Classifier

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity