Layer Normalization

Jimmy Lei Ba,Jamie Ryan Kiros,Geoffrey E. Hinton
DOI: https://doi.org/10.48550/arXiv.1607.06450
2016-07-22
Abstract:Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.
Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of how to effectively reduce training time in deep neural network training, particularly in the application of Recurrent Neural Networks (RNN). Specifically, the paper proposes a new normalization technique—Layer Normalization—to overcome the limitations of Batch Normalization in RNNs. ### Background - **Batch Normalization** significantly reduces the training time of feedforward neural networks by normalizing the input of neurons through calculating the mean and variance for each training batch. However, the effectiveness of Batch Normalization depends on the batch size and faces difficulties when applied to RNNs because the input length of RNNs varies, requiring different statistics for different time steps. - In **online learning tasks** or **large-scale distributed models**, Batch Normalization is challenging to apply due to the limitation of batch size. ### Layer Normalization - **Definition**: Layer Normalization normalizes by calculating the mean and variance of the sum of inputs to all neurons in a layer for a single training case, rather than relying on the entire batch of data as Batch Normalization does. - **Advantages**: - **Independent of batch size**: Layer Normalization does not depend on batch size, making it applicable to online learning tasks or large-scale distributed models. - **Suitable for RNNs**: Layer Normalization can be directly applied to RNNs by calculating normalization statistics at each time step, thereby stabilizing the hidden state dynamics of RNNs. - **Consistent training and testing**: Layer Normalization performs the same calculations during training and testing, avoiding the inconsistency issues of Batch Normalization between training and testing. ### Experimental Results - **Image-Sentence Ranking**: On the MSCOCO dataset, Layer Normalization significantly improved the model's convergence speed and final performance. - **Machine Reading Comprehension**: Experiments on the CNN corpus showed that Layer Normalization not only trained faster but also outperformed baseline models and Batch Normalization variants on the validation set. - **Skip-thought Vectors**: On the BookCorpus dataset, Layer Normalization accelerated the training process and achieved better performance on multiple downstream tasks. ### Conclusion Layer Normalization is an effective normalization technique that can improve training speed and generalization performance in various neural network models, especially excelling in RNNs. Compared to Batch Normalization, Layer Normalization is not limited by batch size and is suitable for online learning and large-scale distributed models.