Abstract:Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at <a class="link-external link-https" href="https://github.com/megvii-research/protoclip" rel="external noopener nofollow">this https URL</a>.

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

How Much Can CLIP Benefit Vision-and-Language Tasks?

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

Image–Text Matching Model Based on CLIP Bimodal Encoding

The Solution for Language-Enhanced Image New Category Discovery

RankCLIP: Ranking-Consistent Language-Image Pretraining

Transferring Image-CLIP to Video-Text Retrieval via Temporal Relations

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Enhancing Vision-Language Model with Unmasked Token Alignment