Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Zichao Li,Cihang Xie,Ekin Dogus Cubuk
2024-04-16
Abstract:This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training strategy depends on the available compute resource. Our analysis reveals that CLIP+Data Augmentation can achieve comparable performance to CLIP using only half of the training data. This work provides practical insights into how to effectively train and deploy CLIP models, making them more accessible and affordable for practical use in various applications.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper explores how to optimize the performance of the Contrastive Language-Image Pre-training (CLIP) model under limited computational resources. The study analyzes scaling of CLIP from three aspects: data, architecture, and training strategies. Firstly, the paper emphasizes the importance of high-quality training data, suggesting that smaller but more precise datasets may outperform larger but lower-quality datasets. Secondly, the study finds that different-sized models have different adaptability to datasets of different sizes, with smaller ViT models being more suitable for small datasets and larger models performing better on large datasets. Additionally, the paper compares four training strategies: SLIP, FLIP, CLIP, and CLIP+ data augmentation, showing that the choice of strategy depends on the available computational resources, and CLIP+ data augmentation can achieve performance comparable to CLIP using only half of the training data. In summary, the paper aims to address how to effectively train and deploy the CLIP model under limited computational resources by adjusting the quality of the dataset, model architecture, and training strategies, in order to improve its accessibility and cost-effectiveness in practical applications.