Abstract:This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation. We have made the code available at <a class="link-external link-https" href="https://github.com/mZhenz/LGD" rel="external noopener nofollow">this https URL</a>.

Dual-teacher Distillation Based on Interpretable Guidance for Lightening Mobile Model

Using Less but Important Information for Feature Distillation

Research on Knowledge Distillation Algorithm of Object Detection

DCCD: Reducing Neural Network Redundancy Via Distillation

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation

Reinforced Multi-Teacher Selection for Knowledge Distillation

Knowledge Distillation with a Precise Teacher and Prediction with Abstention

Lightweight Model Pre-training via Language Guided Knowledge Distillation

DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy

Dual teachers for self-knowledge distillation

Customizing a Teacher for Feature Distillation

Teaching What You Should Teach: A Data-Based Distillation Method

Improved Knowledge Distillation via Teacher Assistant

Beyond Self-Supervision: A Simple Yet Effective Network Distillation Alternative to Improve Backbones

Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching

A Survey on Recent Teacher-student Learning Studies

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

Improving Knowledge Distillation With a Customized Teacher

Channel Distillation: Channel-Wise Attention for Knowledge Distillation

Adaptive Multi-Teacher Multi-level Knowledge Distillation