Abstract:We propose an algorithm capable of identifying and eliminating irrelevant layers of a neural network during the early stages of training. In contrast to weight or filter-level pruning, layer pruning reduces the harder to parallelize sequential computation of a neural network. We employ a structure using residual connections around nonlinear network sections that allow the flow of information through the network once a nonlinear section is pruned. Our approach is based on variational inference principles using Gaussian scale mixture priors on the neural network weights and allows for substantial cost savings during both training and inference. More specifically, the variational posterior distribution of scalar Bernoulli random variables multiplying a layer weight matrix of its nonlinear sections is learned, similarly to adaptive layer-wise dropout. To overcome challenges of concurrent learning and pruning such as premature pruning and lack of robustness with respect to weight initialization or the size of the starting network, we adopt the "flattening" hyper-prior on the prior parameters. We prove that, as a result of its usage, the solutions of the resulting optimization problem describe deterministic networks with parameters of the posterior distribution at either 0 or 1. We formulate a projected SGD algorithm and prove its convergence to such a solution using stochastic approximation results. In particular, we prove conditions that lead to a layer's weights converging to zero and derive practical pruning conditions from the theoretical results. The proposed algorithm is evaluated on the MNIST, CIFAR-10 and ImageNet datasets and common LeNet, VGG16 and ResNet architectures. The simulations demonstrate that our method achieves state-of the-art performance for layer pruning at reduced computational cost in distinction to competing methods due to the concurrent training and pruning.

Efficient Finite Initialization for Tensorized Neural Networks

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications

IterNorm: Fast Iterative Normalization

Evolving Normalization-Activation Layers

Identical Initialization: A Universal Approach to Fast and Stable Training of Neural Networks

Unified Normalization for Accelerating and Stabilizing Transformers

Geometry-aware training of factorized layers in tensor Tucker format

A mathematical framework for improved weight initialization of neural networks using Lagrange multipliers

Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm

Concurrent Training and Layer Pruning of Deep Neural Networks

Transformer Normalisation Layers and the Independence of Semantic Subspaces

SLaNC: Static LayerNorm Calibration

T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor

Towards Understanding the Condensation of Neural Networks at Initial Training

Initialization Seeds Facilitating Neural Network Quantization

Automatic Optimisation of Normalised Neural Networks

Layer Normalization

Efficient anti-symmetrization of a neural network layer by taming the sign problem

Neural Functional Transformers

MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization