Uncertainty-Driven Knowledge Distillation for Language Model Compression.

Tianyu Huang,Weisheng Dong,Fangfang Wu,Xin Li,Guangming Shi
DOI: https://doi.org/10.1109/taslp.2023.3289303
2023-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:Despite the remarkable performance on various Natural Language Processing (NLP) tasks, the parametric complexity of pretrained language models has remained a major obstacle due to limited computational resources in many practical applications. Techniques such as knowledge distillation, network pruning, and quantization have been developed for language model compression. However, it has remained challenging to achieve an optimal tradeoff between model size and inference accuracy. To address this issue, we propose a novel and efficient uncertainty-driven knowledge distillation compression method for transformer-based pretrained language models. Specifically, we design a method of parameter retention and feedforward network parameter distillation to compress N-stacked transformer modules into one module in the fine-tuning stage. A key innovation of our approach is to add the uncertainty estimation module (UEM) into the student network such that it can guide the student network's feature reconstruction in the latent space (similar to the teacher's). Across multiple datasets in the natural language inference tasks of GLUE, we have achieved more than 95% accuracy of the original BERT, while only using about 50% of the parameters.
What problem does this paper attempt to address?