Abstract:Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at:

What problem does this paper attempt to address?

This paper explores how to improve the data efficiency in Contrastive Language-Image Pretraining (CLIP) by focusing on data quality rather than quantity. CLIP achieves impressive zero-shot generalization by pretraining on large-scale image-caption datasets, but it requires a large amount of pretraining data. The study finds that improving the quality of pretraining data is more effective in enhancing CLIP's performance than increasing the data quantity. However, finding the optimal subset of data that can demonstrate the best generalization remains an open question. The paper proposes a novel and theoretically rigorous method for data selection, which focuses on preserving subsets with close cross-covariance between images and captions. These subsets can demonstrate superior zero-shot generalization performance compared to the full dataset. Experiments conducted on Conceptual Captions 3M and 12M datasets show that the selected subsets achieve 2.7 times and 1.4 times higher accuracy on ImageNet and its variants, respectively, and 1.5 times higher average accuracy in 11 downstream tasks compared to the next best baselines. The study also points out that existing data selection techniques, such as gradient-based, loss-based, or prediction entropy-based selection strategies, are not suitable for multimodal learning due to the contrastive loss properties of CLIP. The paper leverages recent theoretical findings, which state that CLIP representation is determined by the cross-covariance matrix of image-caption data, to guide the selection of data subsets. In conclusion, the paper addresses the problem of selecting small data subsets from a large amount of image-text data that can effectively improve CLIP's performance. This approach allows reducing the data requirement while maintaining the model's generalization capability.

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Non-Contrastive Learning Meets Language-Image Pre-Training

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

Improving CLIP Training with Language Rewrites

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Demystifying CLIP Data

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Robust Contrastive Language-Image Pre-training against Data Poisoning and Backdoor Attacks

A Hard-to-Beat Baseline for Training-free CLIP-based Adaptation

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter.

ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model

Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)