TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Maitreya Patel,Abhiram Kusumba,Sheng Cheng,Changhoon Kim,Tejas Gokhale,Chitta Baral,Yezhou Yang
2024-11-05
Abstract:Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. This makes the nature of the training data a significant factor in the efficacy of CLIP for downstream tasks. However, the lack of compositional diversity in contemporary image-text datasets limits the compositional reasoning ability of CLIP. We show that generating ``hard'' negative captions via in-context learning and synthesizing corresponding negative images with text-to-image generators offers a solution. We introduce a novel contrastive pre-training strategy that leverages these hard negative captions and images in an alternating fashion to train CLIP. We demonstrate that our method, named TripletCLIP, when applied to existing datasets such as CC3M and CC12M, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark on an equal computational budget, as well as improvements in zero-shot image classification and image retrieval. Our code, models, and data are available at: <a class="link-external link-https" href="https://tripletclip.github.io" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper "TripletCLIP: Improving Compositional Reasoning of CLIP with Synthetic Visual-Linguistic Hard Negatives" aims to address the shortcomings of current contrastive learning image-text pre-training models (such as CLIP) in compositional reasoning. Specifically, although the CLIP model learns representations by maximizing the mutual information between text and visual modalities, it performs poorly in handling complex compositional tasks due to the lack of compositional diversity in existing image-text datasets. For example, CLIP struggles to distinguish expressions like "bulb in the grass" and "grass in the bulb." To solve this problem, the authors propose a new contrastive pre-training strategy called TripletCLIP. This method enhances CLIP's compositional reasoning ability by generating "hard" negative samples (i.e., samples that are very similar to positive samples but semantically different) and alternately using these negative samples during training. The specific steps include: 1. **Generating Hard Negative Samples**: - Using a large language model (LLM) to generate hard negative text descriptions. - Utilizing a pre-trained text-to-image generation model to generate corresponding negative images. 2. **Introducing a Triplet Contrastive Loss Function**: - Adding an additional supervision term to the traditional contrastive loss function to ensure that the negative image is closer to its corresponding negative text description rather than the positive text description. Using this method, TripletCLIP conducted experiments on existing datasets (such as CC3M and CC12M) and demonstrated significant performance improvements in multiple downstream tasks, particularly in compositional reasoning tasks. For example, in the SugarCrepe benchmark test, TripletCLIP achieved an absolute improvement of over 9% compared to other methods. ### Summary The main contributions of the paper are: - Introducing a new CLIP pre-training strategy that enhances the model's compositional reasoning ability by generating and utilizing hard negative samples (text and images). - Proposing a triplet contrastive loss function that effectively leverages these hard negative samples. - Experimental results show that TripletCLIP performs excellently in multiple downstream tasks, especially in compositional reasoning tasks. Through these innovations, the paper provides new ideas and methods for improving the compositional reasoning ability of multimodal models.