Why Batch Normalization Works? A Buckling Perspective

Li Chcn,Hongxiao Fci,Yanru Xiao,Jiabao He,Haifeng Li
DOI: https://doi.org/10.1109/icinfa.2017.8079081
2017-01-01
Abstract:In deep neural networks, inputs in each layer are affected by all previous parameters of input layers, so even small changes in input distributions to the network are delivered to internal layers, for leading to differences between source domain and target domain, which is known as covariate shift. Batch Normalization(BN) is designed to address the issue, and makes normalization for each training mini-batch. However, the mechanism of BN dealing with covariate shift is not explained in detail. This paper describes the usefulness of BN from the perspective of the physical buckling. Based on the above view, BN can be treated as two kinds of constraints: fixed and pinned. They will affect the distribution of data and neural network training. We compare a simple Convolutional Neural Network(CNN) with BN and a CNN without BN on the MNIST dataset, and we need to get the data of distribution changes in the training process. Also, the distance of different layers calculated by the Earth mover's distance algorithm in the training process is presented to explain both effectiveness and advantages of the covariate shift.
What problem does this paper attempt to address?