Abstract:With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good" sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT$_{\mathrm{BASE}}$. On QNLI and RTE, we obtain 75\% and 42.8\% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40\% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERT$_{\mathrm{BASE}}$ encoder on Xilinx Alveo U200 FPGA board has a 1.83$\times$ speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated subnets of BERT$_{\mathrm{BASE}}$ model on computation restricted devices.

On Importance of Layer Pruning for Smaller BERT Models and Low Resource Languages

Towards Building Efficient Sentence BERT Models using Layer Pruning

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark

An Automatic and Efficient BERT Pruning for Edge AI Systems

On Importance of Pruning and Distillation for Efficient Low Resource NLP

The Unreasonable Ineffectiveness of the Deeper Layers

Structural Pruning of Pre-trained Language Models via Neural Architecture Search

Reassessing Layer Pruning in LLMs: New Insights and Methods

Structured Pruning of a BERT-based Question Answering Model

Pruning Foundation Models for High Accuracy without Retraining

Prune Once for All: Sparse Pre-Trained Language Models

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Pruning before Fine-tuning: A Retraining-free Compression Framework for Pre-trained Language Models

SlimGPT: Layer-wise Structured Pruning for Large Language Models

Less is more: Pruning BERTweet architecture in Twitter sentiment analysis

BlockPruner: Fine-grained Pruning for Large Language Models

Adapting by Pruning: A Case Study on BERT

Structured Pruning of Large Language Models

AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models

Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods

DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT