Abstract:In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of how to improve model performance by introducing semantic composition samples in visual - language contrastive learning. Specifically, the paper proposes a method named CLIP - C, which creates new composite samples by merging two different instances in the dataset, thereby enhancing data efficiency during the pre - training process. This method not only significantly improves the performance of the CLIP model on zero - shot image classification and cross - modal retrieval tasks, but also performs particularly well when the amount of data is relatively limited. ### Method overview 1. **Background**: - **CLIP model**: CLIP is a dual - encoder model, which is used to extract image and text features respectively, and maps these features to a shared embedding space through a projection function. The model is trained using the InfoNCE loss function, with matching image - text pairs as positive samples and non - matching pairs as negative samples. - **InfoNCE loss function**: \[ L_{I2T}=-\frac{1}{B} \sum_{i = 1}^{B} \log \frac{\exp\left(\frac{1}{\tau} \text{sim}(z_i^I, z_i^T)\right)}{\sum_{j = 1}^{B} \exp\left(\frac{1}{\tau} \text{sim}(z_i^I, z_j^T)\right)} \] \[ L_{T2I}=-\frac{1}{B} \sum_{i = 1}^{B} \log \frac{\exp\left(\frac{1}{\tau} \text{sim}(z_i^I, z_i^T)\right)}{\sum_{k = 1}^{B} \exp\left(\frac{1}{\tau} \text{sim}(z_k^I, z_i^T)\right)} \] \[ L=\frac{L_{I2T}+L_{T2I}}{2} \] 2. **CLIP - C method**: - **Sample generation**: In each training step, CLIP - C samples a batch of samples of size B from the dataset. Each paired instance is either an original sample or a combination of two different instances. The generation method of the combined samples is as follows: - **Text**: Connect two original texts by "and", for example \([x_i^T, x_{i'}^T]\). - **Image**: Take the central half - cropped parts of two images and then stitch them together. For example, if the image resolution is \(S\times S\), then take the central cropped part of \((S/2\times S)\) or \((S\times S/2)\). - **Training process**: After generating the composite samples, CLIP - C continues to extract image and text features and uses the InfoNCE loss function for training. ### Experimental results 1. **Zero - shot image classification**: - In multiple downstream benchmark tests, CLIP - C significantly outperforms CLIP. In particular, on the CC3M dataset, the Top - 1 accuracy of CLIP - C on ImageNet is increased by 2%, and it outperforms CLIP in all 12 downstream datasets. - Even on the data - rich CC12M and RedCaps datasets, CLIP - C still outperforms CLIP in multiple benchmark tests. 2. **Zero - shot cross - modal retrieval**: - On the MS - COCO and Flickr30k datasets, CLIP - C shows a significant improvement in image - to - text and text - to - image retrieval tasks. For example, on the CC3M dataset, the Top - 1 accuracy of CLIP - C in image - to - text retrieval is more than 5% higher than that of CLIP. 3. **Linear probe evaluation**: - Using linear probe to evaluate the quality of the learned image features, CLIP - C performs well in multiple benchmark tests, especially in CC3

Semantic Compositions Enhance Vision-Language Contrastive Learning

ComCLIP: Training-Free Compositional Image and Text Matching

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Finetuning CLIP to Reason about Pairwise Differences

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

The Hard Positive Truth about Vision-Language Compositionality

ComAlign: Compositional Alignment in Vision-Language Models

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

In-Context Learning Improves Compositional Understanding of Vision-Language Models