Abstract:Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the output of the teacher model, which can alleviate the training difficulty and give student model a comprehensive understanding of global structure. Differently, token-level distillation requires the student model to learn the output distribution of the teacher model, facilitating a more fine-grained transfer of knowledge. Studies have revealed divergent performances between sentence-level and token-level distillation across different scenarios, leading to the confusion on the empirical selection of knowledge distillation methods. In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for ``simple'' scenarios, while sentence-level distillation excels in ``complex'' scenarios. To substantiate our hypothesis, we systematically analyze the performance of distillation methods by varying the model size of student models, the complexity of text, and the difficulty of decoding procedure. While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method.

Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models.

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Reinforced Multi-Teacher Selection for Knowledge Distillation

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models

Selective Cross-Task Distillation

Multi-Teacher Distillation With Single Model for Neural Machine Translation

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Layerwised multimodal knowledge distillation for vision-language pretrained model

Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

ALP-KD: Attention-Based Layer Projection for Knowledge Distillation

Heterogeneous Student Knowledge Distillation From BERT Using a Lightweight Ensemble Framework

Adaptive Multi-Teacher Multi-level Knowledge Distillation

Knowledge Fusion Distillation: Improving Distillation with Multi-scale Attention Mechanisms

Self-Improving Teacher Cultivates Better Student: Distillation Calibration for Multimodal Large Language Models

Knowledge Distillation Meets Self-Supervision

AMTSS: An Adaptive Multi-Teacher Single-Student Knowledge Distillation Framework For Multilingual Language Inference

Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

Knowledge Distillation with the Reused Teacher Classifier