Abstract:Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at <a class="link-external link-https" href="https://github.com/megvii-research/protoclip" rel="external noopener nofollow">this https URL</a>.

Improving Visual Counterfactual Explanation Models for Image Classification via CLIP

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Making Heads or Tails: Towards Semantically Consistent Visual Counterfactuals

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Investigating the Limitation of CLIP Models: The Worst-Performing Categories

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Improving CLIP Training with Language Rewrites

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

The Solution for Language-Enhanced Image New Category Discovery

ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model