Abstract:We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. In contrast to weight or filter-level pruning, layer pruning reduces the harder to parallelize sequential computation of a neural network. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned. Our approach is based on variational inference principles using Gaussian scale mixture priors on the neural network weights and allows for substantial cost savings during both training and inference. More specifically, the variational posterior distribution of scalar Bernoulli random variables multiplying a layer weight matrix of its nonlinear sections is learned, similarly to adaptive layer-wise dropout. To overcome challenges of concurrent learning and pruning such as premature pruning and lack of robustness with respect to weight initialization or the size of the starting network, we adopt the "flattening" hyper-prior on the prior parameters. We prove that, as a result of its usage, the solutions of the resulting optimization problem describe deterministic networks with parameters of the posterior distribution at either 0 or 1. We formulate a projected SGD algorithm and prove its convergence to such a solution using stochastic approximation results. In particular, we prove conditions that lead to a layer's weights converging to zero and derive practical pruning conditions from the theoretical results. The proposed algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet datasets and common LeNet, VGG16 and ResNet architectures. The simulations demonstrate that our method achieves state-of the-art performance for layer pruning at reduced computational cost in distinction to competing methods due to the concurrent training and pruning.

LayerOut: Freezing Layers in Deep Neural Networks

FreezeOut: Accelerate Training by Progressively Freezing Layers

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

SparseConnect: Regularising CNNs on Fully Connected Layers

Shakeout: A New Approach to Regularized Deep Neural Network Training

Concurrent Training and Layer Pruning of Deep Neural Networks

Effective and Efficient Dropout for Deep Convolutional Neural Networks

Dropout Reduces Underfitting

ChannelDropBack: Forward-Consistent Stochastic Regularization for Deep Networks

Fast Deep Learning Training Through Intelligently Freezing Layers

Regularizing Deep Networks Using Efficient Layerwise Adversarial Training

Learn & drop: fast learning of cnns based on layer dropping

Overfitting Remedy by Sparsifying Regularization on Fully-Connected Layers of CNNs.

LocalDrop: A Hybrid Regularization for Deep Neural Networks

Functional Network: A Novel Framework for Interpretability of Deep Neural Networks

An Adaptive and Stability-Promoting Layerwise Training Approach for Sparse Deep Neural Network Architecture

Layer Normalization

Layer-Stack Temperature Scaling

Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon

AutoDropout: Learning Dropout Patterns to Regularize Deep Networks

PLACE dropout: A Progressive Layer-wise and Channel-wise Dropout for Domain Generalization