Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Jun Kong,Jin Wang,Xuejie Zhang
DOI: https://doi.org/10.1007/978-3-030-88480-2_18
2021-01-01
Abstract:Pretrained language models (PLMs) have achieved remarkable results in various natural language processing tasks. As the performance of the model increases, it is also accompanied by more computational consumption and longer inference time, which makes deploying PLMs in edge devices for low-latency applications challenging. To address this issue, recent studies have recommended applying either model compression or early-exiting techniques to accelerate the inference. However, model compression permanently discards the modules of the model, leading to a decline in model performance. Train the PLMs backbone and the early-exiting classifier separately with early-exiting strategies. It not only brings extra training cost but also loses semantic information from higher layers, resulting in unreliable decisions of early-exiting classifiers. In this study, a weighted ensemble self-distillation method was proposed to improve the early-exiting strategy, which well balanced the performance and the inference time. It enables early-exiting classifiers to obtain rich semantic information from different layers with an attention mechanism according to the contribution of each layer to the final prediction. Furthermore, it simultaneously performs weighted ensemble self-distillation and fine-tuning of the PLMs backbone so that the PLMs can be fine-tuned in the training process of the early-exiting classifier to preserve the performance as much as possible. The experimental results show that the inference of the proposed model was accelerated at the minimum cost of performance loss, thus outperforming the previous early-exiting models. The code is available at: https://github.com/JunKong5/WestBERT.
What problem does this paper attempt to address?