Abstract:We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. In contrast to weight or filter-level pruning, layer pruning reduces the harder to parallelize sequential computation of a neural network. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned. Our approach is based on variational inference principles using Gaussian scale mixture priors on the neural network weights and allows for substantial cost savings during both training and inference. More specifically, the variational posterior distribution of scalar Bernoulli random variables multiplying a layer weight matrix of its nonlinear sections is learned, similarly to adaptive layer-wise dropout. To overcome challenges of concurrent learning and pruning such as premature pruning and lack of robustness with respect to weight initialization or the size of the starting network, we adopt the "flattening" hyper-prior on the prior parameters. We prove that, as a result of its usage, the solutions of the resulting optimization problem describe deterministic networks with parameters of the posterior distribution at either 0 or 1. We formulate a projected SGD algorithm and prove its convergence to such a solution using stochastic approximation results. In particular, we prove conditions that lead to a layer's weights converging to zero and derive practical pruning conditions from the theoretical results. The proposed algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet datasets and common LeNet, VGG16 and ResNet architectures. The simulations demonstrate that our method achieves state-of the-art performance for layer pruning at reduced computational cost in distinction to competing methods due to the concurrent training and pruning.

FreezeOut: Accelerate Training by Progressively Freezing Layers

Fast Deep Learning Training Through Intelligently Freezing Layers

Accelerating Deep Learning Inference via Freezing

LayerOut: Freezing Layers in Deep Neural Networks

Training Acceleration Method Based on Parameter Freezing

Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing

Accurate and Fast Deep Evolutionary Networks Structured Representation Through Activating and Freezing Dense Networks

SmartFRZ: An Efficient Training Framework using Attention-Based Layer Freezing

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Layer Freezing & Data Sieving: Missing Pieces of a Generic Framework for Sparse Training

Synchronize Only the Immature Parameters: Communication-Efficient Federated Learning By Freezing Parameters Adaptively

Concurrent Training and Layer Pruning of Deep Neural Networks

Learn & drop: fast learning of cnns based on layer dropping

AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models

Pushing the Limits of Sparsity: A Bag of Tricks for Extreme Pruning

Weight Freezing: A Regularization Approach for Fully Connected Layers with an Application in EEG Classification

Heterogeneity-Aware Memory Efficient Federated Learning via Progressive Layer Freezing

Freeze the Discriminator: a Simple Baseline for Fine-Tuning GANs

Optimal transfer protocol by incremental layer defrosting

The Unreasonable Ineffectiveness of the Deeper Layers

SemifreddoNets: Partially Frozen Neural Networks for Efficient Computer Vision Systems