Distillation for Text Classification Task Based on BERT

Chuquan Sun,Xinning Li,Shiyu Ge,Zhiyong An,Caiming Zhang
DOI: https://doi.org/10.1109/aemcse51986.2021.00103
2021-01-01
Abstract:In recent years, with the rapid development of the Internet and the surge in the number of web texts, the demand for text classification technology has become increasingly significant. However, there are also the following problems: 1. The maximum input length of the model is 512, and some information will be lost if the longer text is directly truncated; 2. The model is large and the reasoning time is long, which is not convenient for mobile terminal deployment requirements. Aiming at problem 1: Firstly, the text with a length of more than 512 is intercepted. Considering that the end of the text usually contains more emotional information, the intercepting strategy is 170th and 340th. At the same time, another kind of text feature surface of the model is selected: The mean and maximum values are calculated respectively along the dimension of sequence length, which are spliced into column vectors as the input features of the model. Aiming at problem 2, the knowledge distillation of the model for classification tasks is carried out to achieve the reduction parameters to improve the inference efficiency to facilitate the actual deployment and application. Three groups of control experiments show that the overall classification accuracy of the improved BERT model is 97%, and the overall performance is more balanced, and the overall performance is more robust, which is slightly better than the BERT text classification model. The effect of the distilled BERT model is only 1.5% lower than that of the BERT model, but the number of parameters of the model is 92.6% less than that of the original model, and the reasoning time is nearly 4 times faster than the original model, which also shows the effectiveness of improving input features and model compression.
What problem does this paper attempt to address?