Domain-specific knowledge distillation yields smaller and better models for conversational commerce

Kristen Howell,Jian Wang,Akshay Hazare,Joe Bradley,Chris Brew,Xi Chen,Matthew Dunn,Beth-Ann Hockey,Andrew Maurer,D. Widdows
DOI: https://doi.org/10.18653/v1/2022.ecnlp-1.18
2022-01-01
Abstract:We demonstrate that knowledge distillation can be used not only to reduce model size, but to simultaneously adapt a contextual language model to a specific domain. We use Multilingual BERT (mBERT; Devlin et al., 2019) as a starting point and follow the knowledge distillation approach of (Sahn et al., 2019) to train a smaller multilingual BERT model that is adapted to the domain at hand. We show that for in-domain tasks, the domain-specific model shows on average 2.3% improvement in F1 score, relative to a model distilled on domain-general data. Whereas much previous work with BERT has fine-tuned the encoder weights during task training, we show that the model improvements from distillation on in-domain data persist even when the encoder weights are frozen during task training, allowing a single encoder to support classifiers for multiple tasks and languages.
What problem does this paper attempt to address?