Deep-to-Bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation

Wang Ankun,Liu Feng,Huang Zhen,Hu Minghao,Li Dongsheng,Chen Yifan,Xie Xinjia
DOI: https://doi.org/10.1007/978-3-031-10986-7_11
2022-01-01
Abstract:There are millions of parameters and huge computational power consumption behind the outstanding performance of pre-trained language models in natural language processing tasks. Knowledge distillation is considered as a compression strategy to address this problem. However, previous works have the following shortcomings: (i) distill partial transformer layers of the teacher model, which not only do not make full use of the teacher-side information. But also break the coherence of the information, (ii) neglect the difficulty differences of knowledge from deep to shallow, which corresponds to different level information of teacher model. In this paper, we introduce a deep-to-bottom weights decay review mechanism to knowledge distillation, which could fuse teacher-side information while taking each layer’s difficulty level into consideration. To validate our claims, we distill a 12-layer BERT into a 6-layer model and evaluate it on the GLUE dataset. Experimental results show that our review approach is not only able to outperform other existing techniques, but also outperform the original model on partial datasets.
What problem does this paper attempt to address?