Knowledge Distillation for Machine Translation

Zhen Li,Dan Qu,Chaojie Xie,Xuejuan Wei
DOI: https://doi.org/10.23977/csic.2018.0933
2018-01-01
Abstract:Encoder-to-Decoder is a newly architecture for Neural Machine Translation (NMT).Convolutional Neural Network (CNN) based on this framework has gained significant success in NMT task.Challenges remain in the practical use of CNN model, which is in need of bilingual sentence pairs for training and each bilingual data is designed for CNN translation model needing retraining.Although some successful performance has been reported, it is an important research direction to avoid model overfitting caused by the scarcity of parallel corpus.The paper introduces a simple and efficient knowledge distillation method for regularization to solve CNN training overfitting problems by transferring the knowledge of source model to adapted model on low-resource languages in NMT task.The experiment on English-Czech dataset result shows that our model solve the over fitting problem, get better generalization, and improve the performance of a low-resource languages translation task.
What problem does this paper attempt to address?