Abstract:Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models, where a pre-trained teacher model is used to facilitate the training of the target student model. However, the availability of a suitable teacher model is not always guaranteed. To address this challenge, Self-Knowledge Distillation (SKD) attempts to construct a teacher model from itself. Existing SKD methods add Auxiliary Classifiers (AC) to intermediate layers of the model or use the history models and models with different input data within the same class. However, these methods are computationally expensive and only capture time-wise and class-wise features of data. In this paper, we propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher. Specifically, we introduce a Distillation with Reverse Guidance (DRG) method that considers different levels of information extracted by the model, including edge, shape, and detail of the input data, to construct a more informative teacher. Additionally, we design a Distillation with Shape-wise Regularization (DSR) method that ensures a consistent shape of ranked model output for all data. We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models. Our results demonstrate the superiority of the proposed methods over baselines (up to 2.87%) and state-of-the-art SKD methods (up to 1.15%), while being computationally efficient and robust. The code is available at https://github.com/xucong-parsifal/LightSKD.

Knowledge Distillation for Efficient Sequences of Training Runs

Practical Insights into Knowledge Distillation for Pre-Trained Models

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Knowledge Distillation Performs Partial Variance Reduction

Comparative Knowledge Distillation

QEKD: Query-Efficient and Data-Free Knowledge Distillation from Black-box Models.

An Embarrassingly Simple Approach for Knowledge Distillation

Knowledge Condensation Distillation

Lightweight Self-Knowledge Distillation with Multi-source Information Fusion

Dynamic Knowledge Distillation for Pre-trained Language Models

Data Efficient Stagewise Knowledge Distillation

Knowledge Representing: Efficient, Sparse Representation of Prior Knowledge for Knowledge Distillation

Knowledge Distillation Via Channel Correlation Structure

Densely Distilling Cumulative Knowledge for Continual Learning

Learning from a Lightweight Teacher for Efficient Knowledge Distillation

A Closer Look at Knowledge Distillation with Features, Logits, and Gradients

Elastic Knowledge Distillation by Learning from Recollection.

Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation

Collaborative Knowledge Distillation Via Multiknowledge Transfer.

Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models

Gradient Knowledge Distillation for Pre-trained Language Models