Abstract:Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at <a class="link-external link-https" href="https://github.com/megvii-research/protoclip" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in Contrastive Language Image Pretraining (CLIP): 1. **Low efficiency in representation grouping**: - CLIP uses the InfoNCE objective function for pre - training. This objective function achieves representation alignment by pulling positive sample pairs closer and pushing negative sample pairs apart. However, this simple alignment method performs poorly in downstream tasks because representation grouping is not effectively enhanced. Specifically, even if the representations are well - aligned, if the representation distribution is random, the performance of downstream tasks will still be poor. - The authors found that the InfoNCE objective function will indirectly group semantically similar representations together during the training process, but this grouping is achieved through randomly occurring within - modal anchors, which is less efficient and unstable. 2. **The modality gap problem**: - The modality gap refers to the difference between the image and text representation spaces. When this gap is large, the InfoNCE objective function will mainly focus on aligning the two representation spaces rather than learning meaningful representations through anchor grouping. This results in a large modality gap at the beginning of training due to the independent initialization of the two encoders and the "cone effect" of non - linear neural networks, which affects the effectiveness of representation learning. ### Solutions To solve the above problems, the authors proposed ProtoCLIP (Prototypical Contrastive Language Image Pretraining), which mainly includes the following innovations: 1. **Prototype - level discrimination**: - ProtoCLIP elevates instance - level discrimination to prototype - level discrimination by constructing and dynamically updating prototypes in the image and text representation spaces. Each prototype represents a group of semantically similar instances, enhancing the efficiency and stability of representation grouping by directly supervising the learning of the opposite modality. 2. **Prototype Back Translation (PBT)**: - The PBT technique is used to decouple representation grouping and representation alignment. Specifically, PBT calculates the centroids of samples assigned to the same prototype in the student space and uses these centroids to replace the original prototypes to calculate the prototype loss. In this way, the student representations can be directly grouped to the centroids within their modality instead of being forced to align to the cross - modal prototype positions, thus solving the modality gap problem. 3. **Online mini - batch training strategy**: - In order to increase the frequency of cluster updates, ProtoCLIP designs an online mini - batch training strategy, enabling the model to be extended to train on an infinite amount of data sets. 4. **Probabilistic soft labels**: - Compared with traditional hard labels, ProtoCLIP uses probabilistic soft labels to convey structured knowledge relationships. By calculating the similarity between prototypes, the generated probabilistic soft labels can better reflect the semantic relationships between samples, thereby improving the model's representation learning ability. ### Experimental results - On the Conceptual Captions data set, ProtoCLIP improves by 5.81% over CLIP on the ImageNet linear probing task and by 2.01% on the ImageNet zero - shot classification task. - On the larger YFCC - 15M data set, ProtoCLIP achieves the same performance with only 33% of CLIP's training time. Through these improvements, ProtoCLIP achieves more efficient and stable representation learning in large - scale vision - language pre - training.

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Non-Contrastive Learning Meets Language-Image Pre-Training

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Contrastive Localized Language-Image Pre-Training

Improving CLIP Training with Language Rewrites

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning

ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation