Multi-Teacher Distillation With Single Model for Neural Machine Translation

Xiaobo Liang,Lijun Wu,Juntao Li,Tao Qin,Min Zhang,Tie-Yan Liu
DOI: https://doi.org/10.1109/TASLP.2022.3153264
2022-01-01
Abstract:Knowledge distillation (KD) is an effective strategy for neural machine translation (NMT) to improve the performance of a student model. Usually, the teacher can guide the student to be better by distilling the soft label or data knowledge fromthe teacher itself. However, the data diversity and teacher knowledge are limited with only one teacher model. Though a natural solution is to adopt multiple randomized teachermodels, one big shortcoming is that the model parameters and training costs are largely increased with the number of teacher models. In this work, we explore to mimic multiple teacher distillation from the sub-network space and permuted variants of one single teacher model. Specifically, we train a teacher by multiple sub-network extraction paradigms: sub-layer reordering, layer-drop, and dropout variants. In doing so, one teacher model can provide multiple outputs variants and causes neither additional parameters nor much extra training cost. Experiments on 8 IWSLT datasets: IWSLT14 En <-> De, En <-> Es and IWSLT17 En <-> Fr, En <-> Zh and the large WMT14 EN -> DE translation tasks show that our method even achieves nearly comparable performance with multiple teacher models with different randomized parameters, bothword-level and sequence-level knowledge distillation. Our code is available online at https:// github.com/dropreg/ RLD.
What problem does this paper attempt to address?