Compressed-Transformer

Yuan, Chen,Pan, Rong
DOI: https://doi.org/10.1145/3443279.3443302
2020-01-01
Abstract:Recently, Transformer has achieved state-of-the-art performance in neural machine translation. However, the number of parameters in Transformer is so large that it needs to be compressed before deployed and executed on resource-restricted devices. In this paper, we propose a compressed version of Transformer called Compressed-Transformer. We introduce two techniques, factorizing parameters and block reduction, to compress Transformer model. Consequently, the number of parameters can be reduced by more than 50%. We exploit a stage-wise distillation strategy with the temperature dynamically adjusted in knowledge distillation practice to transfer knowledge from base Transformer (teacher) to Compressed-Transformer (student). A Chinese-to-English (Zh~En) dataset of United Nations Parallel Corpus and a German-to-English (De~En) dataset of Multi30K are used, and the experimental results show that our compressed model achieves BLEU score only slightly lower than uncompressed teacher model. Specially, when the number of parameters is reduced by 59.3%, the student model can achieve BLEU score of 40.69, only 1.64 lower than that of the teacher model, and the inference speed is improved by 17% on Zh~En dataset. The experiments on De~En dataset also achieve the similar results.
What problem does this paper attempt to address?