Abstract:Contrastive Language-Image Pre-training (CLIP) has been shown to improve zero-shot generalization capabilities of language and vision models. In this paper, we extend CLIP for efficient knowledge distillation, by utilizing embeddings as teachers. Typical knowledge distillation frameworks require running forward passes through a teacher model, which is often prohibitive in the case of billion or trillion parameter teachers. In these cases, using only the embeddings of the teacher models to guide the distillation can yield significant computational savings. Our preliminary findings show that CLIP-based knowledge distillation with embeddings can outperform full scale knowledge distillation using $9\times$ less memory and $8\times$ less training time. Code available at:

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to utilize pre - computed teacher model embeddings in the knowledge distillation process to improve computational efficiency while maintaining or enhancing the performance of the student model**. Specifically, traditional knowledge distillation methods require multiple forward propagations through the teacher model to generate outputs for guiding the student model. This will lead to huge computational overhead when the number of parameters in the teacher model is extremely large. This paper proposes a new method - **CLIP - Embed - KD**, which uses pre - computed teacher model embeddings to train the student model, thus avoiding repeated forward propagations of the teacher model and greatly reducing the consumption of computational resources. ### Main contributions of the paper: 1. **Introduction of the CLIP - Embed - KD framework**: Utilize pre - computed teacher model embeddings as guiding signals, replacing the dependence on the teacher model in traditional knowledge distillation. 2. **Improvement of computational efficiency**: Compared with traditional knowledge distillation methods, CLIP - Embed - KD can significantly reduce the consumption of memory and training time. 3. **Verification of the effectiveness of the method**: Experimental results show that CLIP - Embed - KD can achieve performance close to or even better than traditional knowledge distillation methods with fewer resources. ### Specific questions: - **How to utilize the contrastive learning objective function of CLIP in knowledge distillation?** - **What are the differences in computational efficiency and accuracy between CLIP - Embed - KD and CLIP - Teacher - KD?** Through the research of these questions, the author hopes to explore a more efficient knowledge distillation method that can achieve better model compression and performance transfer under limited computational resources.

CLIP-Embed-KD: Computationally Efficient Knowledge Distillation Using Embeddings as Teachers

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

Enhancing CLIP Conceptual Embedding through Knowledge Distillation

ComKD-CLIP: Comprehensive Knowledge Distillation for Contrastive Language-Image Pre-traning Model

CLIP-KD: An Empirical Study of CLIP Model Distillation

DistilE: Distiling Knowledge Graph Embeddings for Faster and Cheaper Reasoning

Embedding Compression for Teacher-to-Student Knowledge Transfer

Linear Projections of Teacher Embeddings for Few-Class Distillation

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings

Comparative Knowledge Distillation

An Embarrassingly Simple Approach for Knowledge Distillation

Improving Knowledge Distillation with Teacher's Explanation

ConaCLIP: Exploring Distillation of Fully-Connected Knowledge Interaction Graph for Lightweight Text-Image Retrieval

Online Knowledge Distillation via Collaborative Learning

Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

ResKD: Residual-Guided Knowledge Distillation

Hybrid mix-up contrastive knowledge distillation

DCD: Discriminative and Consistent Representation Distillation

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Learning to Project for Cross-Task Knowledge Distillation