Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

Siddharth Joshi,Arnav Jain,Ali Payani,Baharan Mirzasoleiman
2024-03-20
Abstract:Contrastive Language-Image Pre-training (CLIP) on large-scale image-caption datasets learns representations that can achieve remarkable zero-shot generalization. However, such models require a massive amount of pre-training data. Improving the quality of the pre-training data has been shown to be much more effective in improving CLIP's performance than increasing its volume. Nevertheless, finding small subsets of training data that provably generalize the best has remained an open question. In this work, we propose the first theoretically rigorous data selection method for CLIP. We show that subsets that closely preserve the cross-covariance of the images and captions of the full data provably achieve a superior generalization performance. Our extensive experiments on ConceptualCaptions3M and ConceptualCaptions12M demonstrate that subsets found by \method\ achieve over 2.7x and 1.4x the accuracy of the next best baseline on ImageNet and its shifted versions. Moreover, we show that our subsets obtain 1.5x the average accuracy across 11 downstream datasets, of the next best baseline. The code is available at:
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper explores how to improve the data efficiency in Contrastive Language-Image Pretraining (CLIP) by focusing on data quality rather than quantity. CLIP achieves impressive zero-shot generalization by pretraining on large-scale image-caption datasets, but it requires a large amount of pretraining data. The study finds that improving the quality of pretraining data is more effective in enhancing CLIP's performance than increasing the data quantity. However, finding the optimal subset of data that can demonstrate the best generalization remains an open question. The paper proposes a novel and theoretically rigorous method for data selection, which focuses on preserving subsets with close cross-covariance between images and captions. These subsets can demonstrate superior zero-shot generalization performance compared to the full dataset. Experiments conducted on Conceptual Captions 3M and 12M datasets show that the selected subsets achieve 2.7 times and 1.4 times higher accuracy on ImageNet and its variants, respectively, and 1.5 times higher average accuracy in 11 downstream tasks compared to the next best baselines. The study also points out that existing data selection techniques, such as gradient-based, loss-based, or prediction entropy-based selection strategies, are not suitable for multimodal learning due to the contrastive loss properties of CLIP. The paper leverages recent theoretical findings, which state that CLIP representation is determined by the cross-covariance matrix of image-caption data, to guide the selection of data subsets. In conclusion, the paper addresses the problem of selecting small data subsets from a large amount of image-text data that can effectively improve CLIP's performance. This approach allows reducing the data requirement while maintaining the model's generalization capability.