Abstract:In the natural language processing (NLP) literature, neural networks are becoming increasingly deeper and more complex. Recent advancements in neural NLP are large pretrained language models (e.g. BERT), which lead to significant performance gains in various downstream tasks. Such models, however, require intensive computational resource to train and are difficult to deploy in practice due to poor inference-time efficiency. In this thesis, we are trying to solve this problem through knowledge distillation (KD), where a large pretrained model serves as teacher and transfers its knowledge to a small student model. We also want to demonstrate the competitiveness of small, shallow neural networks. We propose a simple yet effective approach that transfers the knowledge of a large pretrained network (namely, BERT) to a shallow neural architecture (namely, a bidirectional long short-term memory network). To facilitate this process, we propose heuristic data augmentation methods, so that the teacher model can better express its knowledge on the augmented corpus. Experimental results on various natural language understanding tasks show that our distilled model achieves high performance comparable to the ELMo model (a LSTM based pretrained model) in both single-sentence and sentence-pair tasks, while using roughly 60–100 times fewer parameters and 8–15 times less inference time. Although experiments show that small BiLSTMs are more expressive on natural language tasks than previously thought, we wish to further exploit its capacity through a different KD framework. We propose MKD, a Multi-Task Knowledge Distillation Approach. It distills the student model from different tasks jointly, so that the distilled model learns a more universal language representation by leveraging cross-task data. Furthermore, we evaluate our approach on two different student model architectures, one is bi-attentive LSTM based network, another uses three layer Transformer models. For LSTM based student, our approach keeps the advantage of inference speed while maintaining comparable performance as those specifically designed for Transformer methods. For our Transformerbased student, it does provide a modest gain, and outperforms other KD methods without using external training data.

Uncertainty-Driven Knowledge Distillation for Language Model Compression.

Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Patient Knowledge Distillation for BERT Model Compression

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

Joint Dual Feature Distillation and Gradient Progressive Pruning for BERT compression

Knowledge Distillation of Transformer-based Language Models Revisited

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Knowledge Distillation with Source-free Unsupervised Domain Adaptation for BERT Model Compression.

Towards Efficient Pre-Trained Language Model Via Feature Correlation Distillation

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models

Knowledge Distillation with Reptile Meta-Learning for Pretrained Language Model Compression.

Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT

Knowledge Distillation Application Technology for Chinese NLP

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Towards Effective Utilization of Pre-trained Language Models