SDSK2BERT: Explore the Specific Depth with Specific Knowledge to Compress BERT.

Lifang Ding,Yujiu Yang
DOI: https://doi.org/10.1109/icbk50248.2020.00066
2020-01-01
Abstract:The success of a pretraining model like BERT in Natural Language Processing (NLP) puts forward the demand for model compression. Previous works adopting knowledge distillation (KD) to compress BERT are conducted with fixed depth, thus the problem of over-parameterization is not fully explored without answering the appropriate depth for a specific data set. In this work, we take two data sets of Natural Language Inference (NLI) with different difficulty levels as examples to answer the question of layer numbers. During the exploration of depth, we use the learned dataset-specific weights to warm up the networks in the next run, making the model find a better local optimum. With 1%~2% drops on the accuracy, our method reduces the 12-layer BERT model to 6-layer on the MNLI-matched dataset and 2-layer on the DNLI dataset, which not only reduces the parameters to 1/2x and 1/6x respectively but also outperforms the general knowledge distillation framework by about 1% accuracy. What's more, we explain why and when our framework works with the help of visualization.
What problem does this paper attempt to address?