Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models.

Linqing Liu,Huan Wang,Jimmy Lin,Richard Socher,Caiming Xiong
2019-01-01
Abstract:Pretrained language models have led to significant performance gains in manyNLP tasks. However, the intensive computing resources to train such modelsremain an issue. Knowledge distillation alleviates this problem by learning alight-weight student model. So far the distillation approaches are alltask-specific. In this paper, we explore knowledge distillation under themulti-task learning setting. The student is jointly distilled across differenttasks. It acquires more general representation capacity through multi-taskingdistillation and can be further fine-tuned to improve the model in the targetdomain. Unlike other BERT distillation methods which specifically designed forTransformer-based architectures, we provide a general learning framework. Ourapproach is model agnostic and can be easily applied on different futureteacher model architectures. We evaluate our approach on a Transformer-basedand LSTM based student model. Compared to a strong, similarly LSTM-basedapproach, we achieve better quality under the same computational constraints.Compared to the present state of the art, we reach comparable results with muchfaster inference speed.
What problem does this paper attempt to address?