Batch Normalization and the impact of batch structure on the behavior of deep convolution networks

Mohamed Hajaj,Duncan Gillies
DOI: https://doi.org/10.48550/arXiv.1802.07590
2018-02-21
Abstract:Batch normalization was introduced in 2015 to speed up training of deep convolution networks by normalizing the activations across the current batch to have zero mean and unity variance. The results presented here show an interesting aspect of batch normalization, where controlling the shape of the training batches can influence what the network will learn. If training batches are structured as balanced batches (one image per class), and inference is also carried out on balanced test batches, using the batch's own means and variances, then the conditional results will improve considerably. The network uses the strong information about easy images in a balanced batch, and propagates it through the shared means and variances to help decide the identity of harder images on the same batch. Balancing the test batches requires the labels of the test images, which are not available in practice, however further investigation can be done using batch structures that are less strict and might not require the test image labels. The conditional results show the error rate almost reduced to zero for nontrivial datasets with small number of classes such as the CIFAR10.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to influence the learning behavior of deep convolutional neural networks (DCN) by controlling the structure of training batches, especially when using batch normalization (BN)**. ### Specific description of the problem 1. **The role of batch normalization**: Batch normalization was introduced in 2015. It aims to accelerate the training of deep convolutional networks by normalizing the activation values in the current batch to have zero mean and unit variance. This not only speeds up the training but also improves the performance of the final model. 2. **The influence of batch structure**: The paper points out that the implementation of batch normalization makes the activation values of all images in a batch share the mean and variance for normalization processing. Therefore, the output activation value of an image will be affected by other images in the same batch. This means that the behavior of the network depends not only on a single image but also on the structure of the training batch. 3. **The effect of balanced batches**: The paper explores the influence when using **balanced batches** (batches that contain only one image of each category) for training and inference. Specifically: - If both training and inference use balanced batches, and the mean and variance of the current test batch are used during inference, the performance of the network will be significantly improved. - The use of balanced batches enables the network to utilize the information of easily - classified images and pass it to more difficult - to - classify images through the shared mean and variance, thereby improving the overall classification performance. 4. **Challenges in practical applications**: Although using balanced batches has achieved very good results in experiments, in practical applications, constructing a balanced test batch requires knowing the labels of test images, which is not feasible. Therefore, the paper also explores the possibility of using batch structure to improve network performance without relying on test labels. ### Formula representation The formulas for batch normalization are as follows: - Mean calculation: \[ \mu=\frac{1}{m'}\sum_{i = 1}^{m'}x_i \] where \(m'=m\times h\times w\), \(m\) is the batch size, and \(h\times w\) is the size of the feature map. - Variance calculation: \[ \sigma^{2}=\frac{1}{m'}\sum_{i = 1}^{m'}(x_i-\mu)^{2} \] - Normalization operation: \[ \hat{x}_i=\frac{x_i-\mu}{\sqrt{\sigma^{2}+\epsilon}} \] where \(\epsilon\) is a very small constant used to prevent division - by - zero errors. - Linear transformation: \[ y = \alpha\hat{x}_i+\beta \] where \(\alpha\) and \(\beta\) are trainable parameters. ### Summary The core problem of the paper is to explore that the role of batch normalization in deep convolutional networks is not only limited to accelerating training, but also can guide the network to learn additional logic by controlling the structure of training batches. In particular, using balanced batches can significantly improve the classification performance of the network, but in practical applications, there is a challenge of how to utilize this advantage without relying on test labels.