Abstract:With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good" sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT$_{\mathrm{BASE}}$. On QNLI and RTE, we obtain 75\% and 42.8\% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40\% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERT$_{\mathrm{BASE}}$ encoder on Xilinx Alveo U200 FPGA board has a 1.83$\times$ speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated subnets of BERT$_{\mathrm{BASE}}$ model on computation restricted devices.

Accelerating BERT inference with GPU-efficient exit prediction

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference.

Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

SkipBERT: Efficient Inference with Shallow Layer Skipping

EarlyBERT: Efficient BERT Training Via Early-bird Lottery Tickets

DE$^3$-BERT: Distance-Enhanced Early Exiting for BERT based on Prototypical Networks

Fast and Accurate FSA System Using ELBERT: An Efficient and Lightweight BERT

Exponentially Faster Language Modelling

SlowBERT: Slow-down Attacks on Input-adaptive Multi-exit BERT

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Breaking MLPerf Training: A Case Study on Optimizing BERT

DPBERT: Efficient Inference for BERT based on Dynamic Planning

Accelerating Large Batch Training via Gradient Signal to Noise Ratio (GSNR)

G-Bert: Enabling Green BERT Deployment on FPGA Via Hardware-Aware Hybrid Pruning

Accelerating BERT Inference for Sequence Labeling Via Early-Exit.

An Automatic and Efficient BERT Pruning for Edge AI Systems

A new computationally efficient method to tune BERT networks – transfer learning

Efficient Training of BERT by Progressively Stacking.

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

HuBERT-EE: Early Exiting HuBERT for Efficient Speech Recognition

ExtremeBERT: A Toolkit for Accelerating Pretraining of Customized BERT