Abstract:Network pruning is widely used for reducing the heavy inference cost of deep models in low-resource settings. A typical pruning algorithm is a three-stage pipeline, i.e., training (a large model), pruning and fine-tuning. During pruning, according to a certain criterion, redundant weights are pruned and important weights are kept to best preserve the accuracy. In this work, we make several surprising observations which contradict common beliefs. For all state-of-the-art structured pruning algorithms we examined, fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. For pruning algorithms which assume a predefined target network architecture, one can get rid of the full pipeline and directly train the target network from scratch. Our observations are consistent for multiple network architectures, datasets, and tasks, which imply that: 1) training a large, over-parameterized model is often not necessary to obtain an efficient final model, 2) learned "important" weights of the large model are typically not useful for the small pruned model, 3) the pruned architecture itself, rather than a set of inherited "important" weights, is more crucial to the efficiency in the final model, which suggests that in some cases pruning can be useful as an architecture search paradigm. Our results suggest the need for more careful baseline evaluations in future research on structured pruning methods. We also compare with the "Lottery Ticket Hypothesis" (Frankle & Carbin 2019), and find that with optimal learning rate, the "winning ticket" initialization as used in Frankle & Carbin (2019) does not bring improvement over random initialization.

EBERT: Efficient BERT Inference with Dynamic Structured Pruning.

DDK: Dynamic structure pruning based on differentiable search and recursive knowledge distillation for BERT

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

Class-Aware Pruning for Efficient Neural Networks

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

An Automatic and Efficient BERT Pruning for Edge AI Systems

SmartBERT: A Promotion of Dynamic Early Exiting Mechanism for Accelerating BERT Inference.

Adaptive Activation-based Structured Pruning

Towards Building Efficient Sentence BERT Models using Layer Pruning

Structured Pruning of a BERT-based Question Answering Model

Structured Pruning of Large Language Models

Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning

Reconstruct the Pruned Model without Any Retraining

Structured Term Pruning for Computational Efficient Neural Networks Inference

G-Bert: Enabling Green BERT Deployment on FPGA Via Hardware-Aware Hybrid Pruning

Rethinking the Value of Network Pruning

Cloud–Edge Collaborative Inference with Network Pruning

LEAP: Learnable Pruning for Transformer-based Models

When to Prune? A Policy towards Early Structural Pruning

Structured Pruning Learns Compact and Accurate Models

Towards Structured Dynamic Sparse Pre-Training of BERT