Abstract:Many recent breakthroughs in machine learning have been enabled by the pre-trained foundation models. By scaling up model parameters, training data, and computation resources, foundation models have significantly advanced the state-of-the-art in many applications. However, it is still an open question of how to use these models to perform downstream tasks efficiently. Knowledge distillation (KD) has been explored to tackle this challenge. KD transfers knowledge from a large teacher model to a smaller student model. While KD has been successful in improving student model performance, recent research has discovered that a powerful teacher does not necessarily lead to a powerful student, due to their huge capacity gap. In addition, the potential distribution shifts between the pre-training data and downstream tasks can make knowledge transfer in KD sub-optimal for improving downstream task performance. In this paper, we extend KD with an interactive communication process to help students of downstream tasks learn effectively from pre-trained foundation models. Our design is inspired by the way humans learn from teachers who can explain knowledge in a way that meets the students' needs. Specifically, we let each model (i.e., student and teacher) train two components: (1) an encoder encoding the model's hidden states to a message and (2) a decoder decoding any messages to its own hidden states. With encoder and decoder, not only can the teacher transfer rich information by encoding its hidden states, but also the student can send messages with information of downstream tasks to the teacher. Therefore, knowledge passing from teacher to student can be tailored to the student's capacity and downstream tasks' distributions. We conducted experiments on benchmark datasets to show that our communication mechanism outperforms state-of-the-art distillation techniques.

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

DCCD: Reducing Neural Network Redundancy Via Distillation

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Patient Knowledge Distillation for BERT Model Compression

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Reinforced Multi-Teacher Selection for Knowledge Distillation

Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT

Explanation Guided Knowledge Distillation for Pre-trained Language Model Compression

Dynamic Multi Teacher Knowledge Distillation for Semantic Parsing in Kbqa

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Tree-structured Auxiliary Online Knowledge Distillation

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Augmenting Knowledge Distillation with Peer-to-Peer Mutual Learning for Model Compression

Weight-Inherited Distillation for Task-Agnostic BERT Compression

Multi-head Knowledge Distillation for Model Compression

Bi-Level Orthogonal Multi-Teacher Distillation

Improving task-agnostic BERT distillation with layer mapping search

Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication