Abstract:While pre-trained language models (e.g., BERT) have achieved impressive results on different natural language processing tasks, they have large numbers of parameters and suffer from big computational and memory costs, which make them difficult for real-world deployment. Therefore, model compression is necessary to reduce the computation and memory cost of pre-trained models. In this work, we aim to compress BERT and address the following two challenging practical issues: (1) The compression algorithm should be able to output multiple compressed models with different sizes and latencies, in order to support devices with different memory and latency limitations; (2) The algorithm should be downstream task agnostic, so that the compressed models are generally applicable for different downstream tasks. We leverage techniques in neural architecture search (NAS) and propose NAS-BERT, an efficient method for BERT compression. NAS-BERT trains a big supernet on a search space containing a variety of architectures and outputs multiple compressed models with adaptive sizes and latency. Furthermore, the training of NAS-BERT is conducted on standard self-supervised pre-training tasks (e.g., masked language model) and does not depend on specific downstream tasks. Thus, the compressed models can be used across various downstream tasks. The technical challenge of NAS-BERT is that training a big supernet on the pre-training task is extremely costly. We employ several techniques including block-wise search, search space pruning, and performance approximation to improve search efficiency and accuracy. Extensive experiments on GLUE and SQuAD benchmark datasets demonstrate that NAS-BERT can find lightweight models with better accuracy than previous approaches, and can be directly applied to different downstream tasks with adaptive model sizes for different requirements of memory or latency.

AdaBERT: Task-Adaptive BERT Compression with Differentiable Neural Architecture Search

NAS-BERT: Task-Agnostic and Adaptive-Size BERT Compression with Neural Architecture Search

Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains.

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

You Only Compress Once: Towards Effective and Elastic BERT Compression Via Exploit-Explore Stochastic Nature Gradient.

On-Demand Deep Model Compression for Mobile Devices

LAD: Layer-Wise Adaptive Distillation for BERT Model Compression

Compressing Pre-trained Models of Code into 3 MB

ROSITA: Refined BERT cOmpreSsion with InTegrAted techniques

{E}fficient{BERT}: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation

Combining Compressions for Multiplicative Size Scaling on Natural Language Tasks

Adaptive Contrastive Knowledge Distillation for BERT Compression

AdaSpring

Exploring Extreme Parameter Compression for Pre-trained Language Models

Application Specific Compression of Deep Learning Models

Prune Once for All: Sparse Pre-Trained Language Models

On the Compression of Language Models for Code: An Empirical Study on CodeBERT

End-to-End Neural Network Compression via $\frac{\ell_1}{\ell_2}$ Regularized Latency Surrogates

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

Weight-Inherited Distillation for Task-Agnostic BERT Compression

ADA-Tucker: Compressing Deep Neural Networks via Adaptive Dimension Adjustment Tucker Decomposition