Abstract:The complexity of deep neural network models (DNNs) severely limits their application on devices with limited computing and storage resources. Knowledge distillation (KD) is an attractive model compression technology that can effectively alleviate this problem. Multi-teacher knowledge distillation (MKD) aims to leverage the valuable and diverse knowledge distilled by multiple teacher networks to improve the performance of the student network. Existing approaches typically rely on simple methods such as averaging the prediction logits or using sub-optimal weighting strategies to fuse distilled knowledge from multiple teachers. However, employing these techniques cannot fully reflect the importance of teachers and may even mislead student's learning. To address this issue, we propose a novel Decoupled Multi-Teacher Knowledge Distillation based on Entropy (DE-MKD). DE-MKD decouples the vanilla knowledge distillation loss and assigns adaptive weights to each teacher to reflect its importance based on the entropy of their predictions. Furthermore, we extend the proposed approach to distill the intermediate features from multiple powerful but cumbersome teachers to improve the performance of the lightweight student network. Extensive experiments on the publicly available CIFAR-100 image classification benchmark dataset with various teacher-student network pairs demonstrated the effectiveness and flexibility of our approach. For instance, the VGG8|ShuffleNetV2 model trained by DE-MKD reached 75.25%|78.86% top-one accuracy when choosing VGG13|WRN40-2 as the teacher, setting new performance records. In addition, surprisingly, the distilled student model outperformed the teacher in both teacher-student network pairs.

Multi-Teacher Distillation With Single Model for Neural Machine Translation

Multilingual Neural Machine Translation with Knowledge Distillation

Reinforced Multi-Teacher Selection for Knowledge Distillation

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

AMTSS: An Adaptive Multi-Teacher Single-Student Knowledge Distillation Framework For Multilingual Language Inference

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Adaptive Multi-Teacher Multi-level Knowledge Distillation

Life-long Learning for Multilingual Neural Machine Translation with Knowledge Distillation

Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Selective Knowledge Distillation for Neural Machine Translation

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Adaptive Multi-Teacher Knowledge Distillation with Meta-Learning

Building a Multi-domain Neural Machine Translation Model using Knowledge Distillation

Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Unraveling Key Factors of Knowledge Distillation

Redistributing Low-Frequency Words: Making the Most of Monolingual Data in Non-Autoregressive Translation