Abstract:Many recent breakthroughs in machine learning have been enabled by the pre-trained foundation models. By scaling up model parameters, training data, and computation resources, foundation models have significantly advanced the state-of-the-art in many applications. However, it is still an open question of how to use these models to perform downstream tasks efficiently. Knowledge distillation (KD) has been explored to tackle this challenge. KD transfers knowledge from a large teacher model to a smaller student model. While KD has been successful in improving student model performance, recent research has discovered that a powerful teacher does not necessarily lead to a powerful student, due to their huge capacity gap. In addition, the potential distribution shifts between the pre-training data and downstream tasks can make knowledge transfer in KD sub-optimal for improving downstream task performance. In this paper, we extend KD with an interactive communication process to help students of downstream tasks learn effectively from pre-trained foundation models. Our design is inspired by the way humans learn from teachers who can explain knowledge in a way that meets the students' needs. Specifically, we let each model (i.e., student and teacher) train two components: (1) an encoder encoding the model's hidden states to a message and (2) a decoder decoding any messages to its own hidden states. With encoder and decoder, not only can the teacher transfer rich information by encoding its hidden states, but also the student can send messages with information of downstream tasks to the teacher. Therefore, knowledge passing from teacher to student can be tailored to the student's capacity and downstream tasks' distributions. We conducted experiments on benchmark datasets to show that our communication mechanism outperforms state-of-the-art distillation techniques.

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Gradient Knowledge Distillation for Pre-trained Language Models

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Improve Knowledge Distillation via Label Revision and Data Selection

Dynamic Knowledge Distillation for Pre-trained Language Models

Improved Knowledge Distillation via Adversarial Collaboration

Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation

Multi-target Knowledge Distillation Via Student Self-reflection

Parameter-Efficient and Student-Friendly Knowledge Distillation

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights

Boosting Knowledge Distillation Via Intra-class Logit Distribution Smoothing

Knowledge Representing: Efficient, Sparse Representation of Prior Knowledge for Knowledge Distillation

Knowledge Distillation with a Precise Teacher and Prediction with Abstention

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Collaborative Knowledge Distillation

Reinforced Multi-Teacher Selection for Knowledge Distillation

Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models

Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

What Knowledge Gets Distilled in Knowledge Distillation?