Abstract:In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of CLIP-like models can be improved significantly through the introduction of semantically composite examples during pretraining. Inspired by CutMix in vision categorization, we create semantically composite image-caption pairs by merging elements from two distinct instances in the dataset via a novel procedure. Our method fuses the captions and blends 50% of each image to form a new composite sample. This simple technique (termed CLIP-C for CLIP Compositions), devoid of any additional computational overhead or increase in model parameters, significantly improves zero-shot image classification and cross-modal retrieval. The benefits of CLIP-C are particularly pronounced in settings with relatively limited pretraining data.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
The paper aims to solve the problem of how to improve model performance by introducing semantic composition samples in visual - language contrastive learning. Specifically, the paper proposes a method named CLIP - C, which creates new composite samples by merging two different instances in the dataset, thereby enhancing data efficiency during the pre - training process. This method not only significantly improves the performance of the CLIP model on zero - shot image classification and cross - modal retrieval tasks, but also performs particularly well when the amount of data is relatively limited.
### Method overview
1. **Background**:
- **CLIP model**: CLIP is a dual - encoder model, which is used to extract image and text features respectively, and maps these features to a shared embedding space through a projection function. The model is trained using the InfoNCE loss function, with matching image - text pairs as positive samples and non - matching pairs as negative samples.
- **InfoNCE loss function**:
\[
L_{I2T}=-\frac{1}{B} \sum_{i = 1}^{B} \log \frac{\exp\left(\frac{1}{\tau} \text{sim}(z_i^I, z_i^T)\right)}{\sum_{j = 1}^{B} \exp\left(\frac{1}{\tau} \text{sim}(z_i^I, z_j^T)\right)}
\]
\[
L_{T2I}=-\frac{1}{B} \sum_{i = 1}^{B} \log \frac{\exp\left(\frac{1}{\tau} \text{sim}(z_i^I, z_i^T)\right)}{\sum_{k = 1}^{B} \exp\left(\frac{1}{\tau} \text{sim}(z_k^I, z_i^T)\right)}
\]
\[
L=\frac{L_{I2T}+L_{T2I}}{2}
\]
2. **CLIP - C method**:
- **Sample generation**: In each training step, CLIP - C samples a batch of samples of size B from the dataset. Each paired instance is either an original sample or a combination of two different instances. The generation method of the combined samples is as follows:
- **Text**: Connect two original texts by "and", for example \([x_i^T, x_{i'}^T]\).
- **Image**: Take the central half - cropped parts of two images and then stitch them together. For example, if the image resolution is \(S\times S\), then take the central cropped part of \((S/2\times S)\) or \((S\times S/2)\).
- **Training process**: After generating the composite samples, CLIP - C continues to extract image and text features and uses the InfoNCE loss function for training.
### Experimental results
1. **Zero - shot image classification**:
- In multiple downstream benchmark tests, CLIP - C significantly outperforms CLIP. In particular, on the CC3M dataset, the Top - 1 accuracy of CLIP - C on ImageNet is increased by 2%, and it outperforms CLIP in all 12 downstream datasets.
- Even on the data - rich CC12M and RedCaps datasets, CLIP - C still outperforms CLIP in multiple benchmark tests.
2. **Zero - shot cross - modal retrieval**:
- On the MS - COCO and Flickr30k datasets, CLIP - C shows a significant improvement in image - to - text and text - to - image retrieval tasks. For example, on the CC3M dataset, the Top - 1 accuracy of CLIP - C in image - to - text retrieval is more than 5% higher than that of CLIP.
3. **Linear probe evaluation**:
- Using linear probe to evaluate the quality of the learned image features, CLIP - C performs well in multiple benchmark tests, especially in CC3