Abstract:Knowledge distillation (KD) is a promising technique for model compression in neural machine translation. However, where the knowledge hides in KD is still not clear, which may hinder the development of KD. In this work, we first unravel this mystery from an empirical perspective and show that the knowledge comes from the top-1 predictions of teachers, which also helps us build a potential connection between word- and sequence-level KD. Further, we point out two inherent issues in vanilla word-level KD based on this finding. Firstly, the current objective of KD spreads its focus to whole distributions to learn the knowledge, yet lacks special treatment on the most crucial top-1 information. Secondly, the knowledge is largely covered by the golden information due to the fact that most top-1 predictions of teachers overlap with ground-truth tokens, which further restricts the potential of KD. To address these issues, we propose a novel method named \textbf{T}op-1 \textbf{I}nformation \textbf{E}nhanced \textbf{K}nowledge \textbf{D}istillation (TIE-KD). Specifically, we design a hierarchical ranking loss to enforce the learning of the top-1 information from the teacher. Additionally, we develop an iterative KD procedure to infuse more additional knowledge by distilling on the data without ground-truth targets. Experiments on WMT'14 English-German, WMT'14 English-French and WMT'16 English-Romanian demonstrate that our method can respectively boost Transformer$_{base}$ students by +1.04, +0.60 and +1.11 BLEU scores and significantly outperform the vanilla word-level KD baseline. Besides, our method shows higher generalizability on different teacher-student capacity gaps than existing KD techniques.

Dual Knowledge Distillation for neural machine translation

Multilingual Neural Machine Translation with Knowledge Distillation

MCKD: Mutually Collaborative Knowledge Distillation for Federated Domain Adaptation and Generalization

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Nearest Neighbor Knowledge Distillation for Neural Machine Translation.

Continual Knowledge Distillation for Neural Machine Translation

Harmonizing knowledge Transfer in Neural Network with Unified Distillation

Life-long Learning for Multilingual Neural Machine Translation with Knowledge Distillation

Decouple Non-parametric Knowledge Distillation For End-to-end Speech Translation

Self-Knowledge Distillation in Natural Language Processing

Building a Multi-domain Neural Machine Translation Model using Knowledge Distillation

Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation

Unraveling Key Factors of Knowledge Distillation

Selective Knowledge Distillation for Neural Machine Translation

Multi-Teacher Distillation With Single Model for Neural Machine Translation

Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation

Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Gradient Knowledge Distillation for Pre-trained Language Models

Dual-Space Knowledge Distillation for Large Language Models

Align-to-Distill: Trainable Attention Alignment for Knowledge Distillation in Neural Machine Translation

Online Knowledge Distillation via Collaborative Learning