Abstract:Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the output of the teacher model, which can alleviate the training difficulty and give student model a comprehensive understanding of global structure. Differently, token-level distillation requires the student model to learn the output distribution of the teacher model, facilitating a more fine-grained transfer of knowledge. Studies have revealed divergent performances between sentence-level and token-level distillation across different scenarios, leading to the confusion on the empirical selection of knowledge distillation methods. In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for ``simple'' scenarios, while sentence-level distillation excels in ``complex'' scenarios. To substantiate our hypothesis, we systematically analyze the performance of distillation methods by varying the model size of student models, the complexity of text, and the difficulty of decoding procedure. While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method.

Progressive distillation induces an implicit curriculum

Follow Your Path: A Progressive Method for Knowledge Distillation

PROD: Progressive Distillation for Dense Retrieval

Progressive Network Grafting for Few-Shot Knowledge Distillation

Progressive Ensemble Distillation: Building Ensembles for Efficient Inference

TC<SUP>3</SUP>KD: Knowledge distillation via teacher-student cooperative curriculum customization

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion

Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

Distilling Inductive Bias: Knowledge Distillation Beyond Model Compression

Education distillation:getting student models to learn in shcools

Understanding the Distillation Process from Deep Generative Models to Tractable Probabilistic Circuits

What Knowledge Gets Distilled in Knowledge Distillation?

Cooperative Knowledge Distillation: A Learner Agnostic Approach

Random Teachers are Good Teachers

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

Contrastive Representation Distillation

Continual Distillation Learning: An Empirical Study of Knowledge Distillation in Prompt-based Continual Learning

Lifelong Learning Via Progressive Distillation And Retrospection

On student-teacher deviations in distillation: does it pay to disobey?