Abstract:Contrastive Language Image Pretraining (CLIP) has received widespread attention, since its learned representations can be transferred well to various downstream tasks. During the training process of the CLIP model, the InfoNCE objective aligns positive image-text pairs and separates negative ones. We show an underlying representation grouping effect during this process: the InfoNCE objective indirectly groups semantically similar representations together via randomly emerged within-modal anchors. Based on this understanding, in this paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping by boosting its efficiency and increasing its robustness against the modality gap. Specifically, ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge. Further, Prototypical Back Translation (PBT) is proposed to decouple representation grouping from representation alignment, resulting in effective learning of meaningful representations under large modality gap. The PBT also enables us to introduce additional external teachers with richer prior language knowledge. ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions and achieved an +5.81% ImageNet linear probing improvement and an +2.01% ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset, ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are available at <a class="link-external link-https" href="https://github.com/megvii-research/protoclip" rel="external noopener nofollow">this https URL</a>.

MoDE: CLIP Data Experts via Clustering

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Demystifying CLIP Data

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

DiffCLIP: Few-shot Language-driven Multimodal Classifier

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

MiCE: Mixture of Contrastive Experts for Unsupervised Image Clustering

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

MoEC: Mixture of Expert Clusters

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Generalization Beyond Data Imbalance: A Controlled Study on CLIP for Transferable Insights

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts