Abstract:Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation encompasses two primary methods: sentence-level distillation and token-level distillation. In sentence-level distillation, the student model is trained to align with the output of the teacher model, which can alleviate the training difficulty and give student model a comprehensive understanding of global structure. Differently, token-level distillation requires the student model to learn the output distribution of the teacher model, facilitating a more fine-grained transfer of knowledge. Studies have revealed divergent performances between sentence-level and token-level distillation across different scenarios, leading to the confusion on the empirical selection of knowledge distillation methods. In this study, we argue that token-level distillation, with its more complex objective (i.e., distribution), is better suited for ``simple'' scenarios, while sentence-level distillation excels in ``complex'' scenarios. To substantiate our hypothesis, we systematically analyze the performance of distillation methods by varying the model size of student models, the complexity of text, and the difficulty of decoding procedure. While our experimental results validate our hypothesis, defining the complexity level of a given scenario remains a challenging task. So we further introduce a novel hybrid method that combines token-level and sentence-level distillation through a gating mechanism, aiming to leverage the advantages of both individual methods. Experiments demonstrate that the hybrid method surpasses the performance of token-level or sentence-level distillation methods and the previous works by a margin, demonstrating the effectiveness of the proposed hybrid method.

Zero-Shot Cross-Lingual Named Entity Recognition Via Progressive Multi-Teacher Distillation

WIDER & CLOSER: Mixture of Short-channel Distillers for Zero-shot Cross-lingual Named Entity Recognition

Discrepancy and Uncertainty Aware Denoising Knowledge Distillation for Zero-Shot Cross-Lingual Named Entity Recognition

ProKD: An Unsupervised Prototypical Knowledge Distillation Network for Zero-Resource Cross-Lingual Named Entity Recognition

Single-/Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language

Multi-Grained Knowledge Distillation for Named Entity Recognition

XtremeDistil: Multi-stage Distillation for Massive Multilingual Models

Knowledge Distillation from Multilingual and Monolingual Teachers for End-to-End Multilingual Speech Recognition

Multi-Teacher Distillation With Single Model for Neural Machine Translation

Three Heads Are Better Than One: Improving Cross-Domain NER with Progressive Decomposed Network

TransAdv: A Translation-based Adversarial Learning Framework for Zero-Resource Cross-Lingual Named Entity Recognition.

UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Cross-domain knowledge distillation for text classification

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Compression Models via Meta-Learning and Structured Distillation for Named Entity Recognition.

Zero-Shot Learning in Named-Entity Recognition with External Knowledge

mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view Contrastive Learning

Distantly-Supervised Named Entity Recognition with Uncertainty-aware Teacher Learning and Student-student Collaborative Learning

Selective Cross-Task Distillation

Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation