Abstract:Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at <a class="link-external link-https" href="https://github.com/megvii-research/protoclip" rel="external noopener nofollow">this https URL</a>.

Parrot Captions Teach CLIP to Spot Text

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

ClipCap: CLIP Prefix for Image Captioning

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Fine-grained Image Captioning with CLIP Reward

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Finetuning CLIP to Reason about Pairwise Differences

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Less is More: Removing Text-regions Improves CLIP Training Efficiency and Robustness

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Updating CLIP to Prefer Descriptions Over Captions

How Much Can CLIP Benefit Vision-and-Language Tasks?

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

Improving CLIP Training with Language Rewrites