Abstract:Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies, including relation, feature, gradient and contrastive paradigms, to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover, interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher, CLIP-KD achieves 57.5\% and 55.4\% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50, surpassing the original CLIP without KD by 20.5\% and 20.1\% margins, respectively. Our code is released on <a class="link-external link-https" href="https://github.com/winycg/CLIP-KD" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the valuable lightweight CLIP model in resource - constrained applications. Specifically, the authors propose CLIP - Knowledge Distillation (CLIP - KD), aiming to supervise and enhance a small - scale student CLIP model through a pre - trained large - scale teacher CLIP model. The paper explores multiple distillation strategies, including relation distillation, feature distillation, gradient distillation, and contrastive learning methods, to evaluate the effectiveness of these methods in CLIP - KD. ### Background of the paper - **CLIP model**: CLIP (Contrastive Language - Image Pre - training) is a model that pre - trains image - text pairs using a contrastive learning framework and can predict the correct image - text pairs. The pre - trained CLIP model performs well in zero - shot multi - modal and uni - modal visual tasks. - **Existing work**: Some studies improve the CLIP model through additional visual self - supervised tasks or masked images. However, few studies explore how to improve the lightweight CLIP model in resource - constrained applications. ### Goals of the paper - **Propose CLIP - KD**: Guide and enhance the small - scale student CLIP model through the large - scale pre - trained teacher CLIP model. - **Explore multiple distillation strategies**: Include methods such as relation distillation (CRD), feature distillation (FD), gradient distillation (GD), and interactive contrastive learning (ICL) to evaluate the effectiveness of these methods in CLIP - KD. - **Verify performance improvement**: Verify the performance improvement of CLIP - KD on student models of different architectures through benchmark tests such as zero - shot ImageNet classification and cross - modal retrieval. ### Main contributions 1. **Propose multiple distillation strategies**: Include relation distillation, feature distillation, gradient distillation, and interactive contrastive learning methods, among which the simple feature distillation method performs particularly well. 2. **Explain the reasons for the success of distillation**: A good CLIP distillation method can maximize the feature similarity between the teacher and student models, thereby narrowing the performance gap. 3. **Provide a comprehensive guide**: Compared with the existing TinyCLIP, CLIP - KD does not depend on a specific architecture, is applicable to any teacher - student model architecture combination, and shows better performance under the same and different architecture styles. ### Experimental results - **Feature distillation (FD)**: Using the simple mean squared error (MSE) loss for feature distillation has the best effect and significantly improves the performance of the student model. - **Interactive contrastive learning (ICL)**: By allowing the student model to perform joint contrastive learning with the teacher model, it also achieves a relatively good performance improvement. - **Comprehensive method**: Combining multiple distillation strategies (such as FD + CRD + ICL) can further improve the performance of the student model. ### Conclusion CLIP - KD effectively improves the performance of the lightweight CLIP model through multiple distillation strategies, especially in zero - shot ImageNet classification and cross - modal retrieval tasks. This provides an effective solution for resource - constrained applications.

CLIP-KD: An Empirical Study of CLIP Model Distillation

ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

DCCD: Reducing Neural Network Redundancy Via Distillation

TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

ViTKD: Feature-based Knowledge Distillation for Vision Transformers

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

DCD: Discriminative and Consistent Representation Distillation

Online Knowledge Distillation Via Mutual Contrastive Learning for Visual Recognition

Hybrid mix-up contrastive knowledge distillation

Enhancing CLIP Conceptual Embedding through Knowledge Distillation

DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Dynamic Contrastive Distillation for Image-Text Retrieval

An Embarrassingly Simple Approach for Knowledge Distillation

Mclip: Multilingual CLIP Via Cross-lingual Transfer.

PromptKD: Unsupervised Prompt Distillation for Vision-Language Models

Rethinking Knowledge Distillation Via Cross-Entropy