Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

Baiyun Cui,Yingming Li,Zhongfei Zhang
DOI: https://doi.org/10.1016/j.neucom.2021.05.084
IF: 6
2021-01-01
Neurocomputing
Abstract:In this paper, we develop a novel Joint Model Compression (referred to as JMC) method by combining structured pruning and dense knowledge distillation techniques to significantly compress original large language model into a deep compressed shallow network. In particular, a new Direct Importance-aware Structured Pruning (referred as DISP) approach is proposed to structurally prune the redundant structures in the Transformer networks directly based on the corresponding parameter matrices in the model. Besides, a Dense Knowledge Distillation (referred to as DKD) method is developed with a many-to-one layer mapping strategy to leverage more comprehensive layer-wise linguistic knowledge for the distillation. Further, the proposed structured pruning and dense knowledge distillation are integrated together to perform the joint compression, which enables us to achieve a significant compression without sacrificing model accuracy. The extensive experimental results across four NLP tasks on seven datasets demonstrate its effectiveness and superiority to the baselines, while maintaining similar performance to original large model with further remarkable benefits for inference-time speedup and memory efficiency.
What problem does this paper attempt to address?