Adaptive Ensemble Self-Distillation with Consistent Gradients for Fast Inference of Pretrained Language Models

Jun Kong,Jin Wang,Xuejie Zhang
DOI: https://doi.org/10.1109/taslp.2023.3331080
2024-01-01
Abstract:Conditional computation algorithms, e.g., the early exiting (EE) strategy, can accelerate the inference of pretrained language models (PLMs) by exiting shallow layers without calculating the entire model. In addition to the adaptive inference of EE prediction for downstream tasks, self-distillation (SD) can encourage EE classifiers to mimic the behavior of the final classifier to enhance their representation capacity. However, the gradients from different tasks of EE classifiers will conflict and cancel one another out. The parameters of the backbone and some EE classifiers will be implicitly prevented from updating. Moreover, if the semantic gap between the final classifier and EE classifiers is significant, the EE classifier's performance will decrease. That is, the final classifier would not be the best choice to enhance the performance of the EE classifier. This study proposed an early exiting strategy with adaptive ensemble self-distillation and consistent gradients for the inference acceleration of PLMs. To mitigate gradient conflicts, we orthogonally projected the distillation loss's backpropagated gradient onto the classification loss's normal plane. Instead of directly using the final classifier as a single-teacher for self-distillation, we dynamically assemble different adaptive teachers for different EE classifiers according to the learning abilities of the EE classifiers. The accumulative decision was drawn for adaptive inference to make accurate and reliable predictions. Experimental results show that the proposed model outperforms existing models with the same speed-up ratio and effectively balances model performance and inference time.
What problem does this paper attempt to address?