Distilling BERT for low complexity network training

Bansidhar Mangalwedhekar
DOI: https://doi.org/10.48550/arXiv.2105.06514
2021-05-14
Abstract:This paper studies the efficiency of transferring BERT learnings to low complexity models like BiLSTM, BiLSTM with attention and shallow CNNs using sentiment analysis on SST-2 dataset. It also compares the complexity of inference of the BERT model with these lower complexity models and underlines the importance of these techniques in enabling high performance NLP models on edge devices like mobiles, tablets and MCU development boards like Raspberry Pi etc. and enabling exciting new applications.
Computation and Language,Machine Learning
What problem does this paper attempt to address?