Knowledge Distillation Application Technology for Chinese NLP

Hanwen Luo,Xiaodong Wang,Wei Yu,Chengyang Chang,Xiaoting Guo
DOI: https://doi.org/10.1109/icpeca51329.2021.9362719
2021-01-01
Abstract:At this stage, the popular deep neural network models often encounter problems of high latency, difficult deployment and high hardware requirements in practical applications. Knowledge distillation is a good approach to solve these problems. We adopted an innovative knowledge distillation approach and formulated data augmentation strategies for the tasks, and obtained a lightweight model with 6.7× acceleration ratio and 13.6× compression ratio compared to the baseline model BERT-base, and the average performance of the lightweight model reached 95% of BERT-base for each task. We continue to conduct in-depth research to investigate some of the issues that remain in the knowledge distillation phase. To address the problems in distillation model selection and model fine-tuning, we propose a teacher model and student model selection strategy and a two-stage model fine-tuning strategy before and after the knowledge distillation stage. These two strategies further improve the average performance of the models to 98% of BERT-base. Finally, we developed a lightweight model evaluation scheme based on different types of downstream tasks, which provides a reference for subsequent practical applications when encountering similar tasks.
What problem does this paper attempt to address?