Abstract:Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its superior zero-shot performance and excellent transferability to downstream tasks. However, training such large-scale models usually requires substantial computation and storage, which poses barriers for general users with consumer-level computers. Motivated by this observation, in this paper we investigate how to achieve competitive performance on only one Nvidia RTX3090 GPU and with one terabyte for storing dataset. On one hand, we simplify the transformer block structure and combine Weight Inheritance with multi-stage Knowledge Distillation (WIKD), thereby reducing the parameters and improving the inference speed during training along with deployment. On the other hand, confronted with the convergence challenge posed by small dataset, we generate synthetic captions for each sample as data augmentation, and devise a novel Pair Matching (PM) loss to fully exploit the distinguishment among positive and negative image-text pairs. Extensive experiments demonstrate that our model can achieve a new state-of-the-art datascale-parameter-accuracy tradeoff, which could further popularize the CLIP model in the related research community.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: when training and deploying large - scale Contrastive Language - Image Pretraining (CLIP) models on consumer - grade computers, how to overcome the limitations of computing resources and storage space, so as to achieve performance comparable to existing large - scale models. Specifically, the paper focuses on the following aspects: 1. **Limitations of computing and storage resources**: - Training CLIP models usually requires a large amount of computing resources and storage space. For example, MobileCLIP [33] is trained using 256 A100 GPUs, and the dataset requires 140 TB of local storage space. - The GPU memory of consumer - grade computers is usually no more than 24GB (such as Nvidia RTX3090), and the storage capacity may be less than 1TB. This makes it very difficult to train CLIP models on these devices. 2. **Number of parameters and inference speed**: - Large - scale models (such as CLIP - B/16 [28]) have a large number of parameters (86.2M for the image encoder and 63.4M for the text encoder), resulting in increased inference latency and difficulty in deploying on devices with limited computing resources. 3. **Challenges of small - scale datasets**: - When training CLIP models on small - scale datasets, the models are prone to convergence problems. Existing datasets (such as CC12M [1]) are not only small in scale but also have low - quality labels, which further increases the difficulty of training. To solve these problems, the paper proposes the following methods: - **Simplifying the model structure**: By introducing SAS - P blocks (Shaped Attention Sub - block Parallel) and applying the weight - sharing strategy, the number of model parameters is reduced and the inference speed is increased. - **Weight Inheritance and multi - stage Knowledge Distillation (WIKD)**: Inherit the weights of the pre - trained model and further optimize the model performance through multi - stage knowledge distillation. - **Pair Matching (PM) loss function**: Propose a new loss function to distinguish positive and negative image - text pairs, so as to improve the training effect of the model on small - scale datasets. - **Enhancing the dataset**: By adding multiple synthetic captions to each image in the CC12M dataset, a new dataset CC12M - SYN is generated to improve data diversity and quality. Through these methods, the paper shows that in the case of using only one RTX3090 GPU and 1TB of storage, a high - performance lightweight CLIP model can be trained and competitive results can be achieved in multiple downstream tasks.

Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers

Improving CLIP Training with Language Rewrites

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

EVA-CLIP: Improved Training Techniques for CLIP at Scale

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

CLIPPO: Image-and-Language Understanding from Pixels Only

Long-CLIP: Unlocking the Long-Text Capability of CLIP

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

CLIP-KD: An Empirical Study of CLIP Model Distillation

How Much Can CLIP Benefit Vision-and-Language Tasks?