Four Things Everyone Should Know to Improve Batch Normalization

Cecilia Summers,Michael J. Dinneen
DOI: https://doi.org/10.48550/arXiv.1906.03548
2020-02-14
Abstract:A key component of most neural network architectures is the use of normalization layers, such as Batch Normalization. Despite its common use and large utility in optimizing deep architectures, it has been challenging both to generically improve upon Batch Normalization and to understand the circumstances that lend themselves to other enhancements. In this paper, we identify four improvements to the generic form of Batch Normalization and the circumstances under which they work, yielding performance gains across all batch sizes while requiring no additional computation during training. These contributions include proposing a method for reasoning about the current example in inference normalization statistics, fixing a training vs. inference discrepancy; recognizing and validating the powerful regularization effect of Ghost Batch Normalization for small and medium batch sizes; examining the effect of weight decay regularization on the scaling and shifting parameters gamma and beta; and identifying a new normalization algorithm for very small batch sizes by combining the strengths of Batch and Group Normalization. We validate our results empirically on six datasets: CIFAR-100, SVHN, Caltech-256, Oxford Flowers-102, CUB-2011, and ImageNet.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the Batch Normalization (BN) method in order to enhance its performance under different batch sizes and ensure that no additional training computational cost is required. Specifically, the paper identifies and addresses the following four issues: 1. **Adjustment of sample weights during the inference process**: - During the inference stage, BN uses the moving average of training statistics, which results in each sample making no contribution to its own normalization statistics, thus creating a difference between training and inference. - The paper proposes a method to adjust the normalization statistics according to the current sample during inference to fix this difference. 2. **Ghost Batch Normalization (GBN) for medium - batch sizes**: - GBN calculates the normalization statistics by dividing each training batch into smaller sub - batches, enhancing the regularization effect. - The paper finds that GBN is not only suitable for large - scale batch training but also has a significant effect on medium - batch sizes, especially when the model is prone to overfitting. 3. **Weight Decay in Batch Normalization**: - Weight Decay is a regularization technique, usually used to prevent the model from overfitting. However, the weight decay effects of the scaling parameter \(\gamma\) and the shift parameter \(\beta\) in the BN layer have not been fully studied. - The paper explores the influence of weight decay on these parameters and finds that its effect depends on the specific situation of the network architecture and the task. 4. **Generalization of Batch Normalization and Group Normalization (GN) for small batches**: - For very small batch sizes, BN performs poorly because reliable normalization statistics cannot be obtained. - The paper proposes a new method that combines BN and GN. By expanding the grouping mechanism of GN, it not only considers the information within the channel but also the information across samples, thereby improving the performance of small - batch training. Through these four improvements, the paper aims to enhance the performance of BN under various batch sizes, especially in the case of small and medium - batch sizes, and all improvements do not require additional computational cost.