Abstract:A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures, it has been challenging both to generically improve upon Batch Normalization and to understand the circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to the generic form of Batch Normalization and the circumstances under which they work, yielding performance gains across all batch sizes while requiring no additional computation during training. These contributions include proposing a method for reasoning about the current example in inference normalization statistics, fixing a training vs. inference discrepancy; recognizing and validating the powerful regularization effect of Ghost Batch Normalization for small and medium batch sizes; examining the effect of weight decay regularization on the scaling and shifting parameters gamma and beta; and identifying a new normalization algorithm for very small batch sizes by combining the strengths of Batch and Group Normalization. We validate our results empirically on six datasets: CIFAR-100, SVHN, Caltech-256, Oxford Flowers-102, CUB-2011, and ImageNet.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the Batch Normalization (BN) method in order to enhance its performance under different batch sizes and ensure that no additional training computational cost is required. Specifically, the paper identifies and addresses the following four issues: 1. **Adjustment of sample weights during the inference process**: - During the inference stage, BN uses the moving average of training statistics, which results in each sample making no contribution to its own normalization statistics, thus creating a difference between training and inference. - The paper proposes a method to adjust the normalization statistics according to the current sample during inference to fix this difference. 2. **Ghost Batch Normalization (GBN) for medium - batch sizes**: - GBN calculates the normalization statistics by dividing each training batch into smaller sub - batches, enhancing the regularization effect. - The paper finds that GBN is not only suitable for large - scale batch training but also has a significant effect on medium - batch sizes, especially when the model is prone to overfitting. 3. **Weight Decay in Batch Normalization**: - Weight Decay is a regularization technique, usually used to prevent the model from overfitting. However, the weight decay effects of the scaling parameter \(\gamma\) and the shift parameter \(\beta\) in the BN layer have not been fully studied. - The paper explores the influence of weight decay on these parameters and finds that its effect depends on the specific situation of the network architecture and the task. 4. **Generalization of Batch Normalization and Group Normalization (GN) for small batches**: - For very small batch sizes, BN performs poorly because reliable normalization statistics cannot be obtained. - The paper proposes a new method that combines BN and GN. By expanding the grouping mechanism of GN, it not only considers the information within the channel but also the information across samples, thereby improving the performance of small - batch training. Through these four improvements, the paper aims to enhance the performance of BN under various batch sizes, especially in the case of small and medium - batch sizes, and all improvements do not require additional computational cost.

Four Things Everyone Should Know to Improve Batch Normalization

Batch Normalization and the impact of batch structure on the behavior of deep convolution networks

Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Group Normalization

Batch Kalman Normalization: Towards Training Deep Neural Networks with Micro-Batches

TOWARDS STABILIZING BATCH STATISTICS IN BACKWARD PROPAGATION OF BATCH NORMALIZATION

Iterative Normalization: Beyond Standardization towards Efficient Whitening

Cross-Iteration Batch Normalization

Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization

An Empirical Analysis of the Shift and Scale Parameters in BatchNorm

Effective and Efficient Batch Normalization Using a Few Uncorrelated Data for Statistics Estimation

Research Progress on Batch Normalization of Deep Learning and Its Related Algorithms

Why Batch Normalization Works? A Buckling Perspective

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

Understanding and Improving Group Normalization

Unified Batch Normalization: Identifying and Alleviating the Feature Condensation in Batch Normalization and a Unified Framework

Extended Batch Normalization

Generalized Batch Normalization: Towards Accelerating Deep Neural Networks

New Interpretations of Normalization Methods in Deep Learning.