Abstract:Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to effectively reduce training time in deep neural network training, particularly in the application of Recurrent Neural Networks (RNN). Specifically, the paper proposes a new normalization technique—Layer Normalization—to overcome the limitations of Batch Normalization in RNNs. ### Background - **Batch Normalization** significantly reduces the training time of feedforward neural networks by normalizing the input of neurons through calculating the mean and variance for each training batch. However, the effectiveness of Batch Normalization depends on the batch size and faces difficulties when applied to RNNs because the input length of RNNs varies, requiring different statistics for different time steps. - In **online learning tasks** or **large-scale distributed models**, Batch Normalization is challenging to apply due to the limitation of batch size. ### Layer Normalization - **Definition**: Layer Normalization normalizes by calculating the mean and variance of the sum of inputs to all neurons in a layer for a single training case, rather than relying on the entire batch of data as Batch Normalization does. - **Advantages**: - **Independent of batch size**: Layer Normalization does not depend on batch size, making it applicable to online learning tasks or large-scale distributed models. - **Suitable for RNNs**: Layer Normalization can be directly applied to RNNs by calculating normalization statistics at each time step, thereby stabilizing the hidden state dynamics of RNNs. - **Consistent training and testing**: Layer Normalization performs the same calculations during training and testing, avoiding the inconsistency issues of Batch Normalization between training and testing. ### Experimental Results - **Image-Sentence Ranking**: On the MSCOCO dataset, Layer Normalization significantly improved the model's convergence speed and final performance. - **Machine Reading Comprehension**: Experiments on the CNN corpus showed that Layer Normalization not only trained faster but also outperformed baseline models and Batch Normalization variants on the validation set. - **Skip-thought Vectors**: On the BookCorpus dataset, Layer Normalization accelerated the training process and achieved better performance on multiple downstream tasks. ### Conclusion Layer Normalization is an effective normalization technique that can improve training speed and generalization performance in various neural network models, especially excelling in RNNs. Compared to Batch Normalization, Layer Normalization is not limited by batch size and is suitable for online learning and large-scale distributed models.

Layer Normalization

Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes

Understanding and Improving Layer Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks

On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units

On the Nonlinearity of Layer Normalization

Breaking Time Invariance: Assorted-Time Normalization for RNNs

Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Local Feature Normalization

On Layer Normalization in the Transformer Architecture

Normalized Activation Function: Toward Better Convergence

Rethinking Residual Connection with Layer Normalization

Unified Normalization for Accelerating and Stabilizing Transformers

Batchless Normalization: How to Normalize Activations Across Instances with Minimal Memory Requirements

On Centralization and Unitization of Batch Normalization for Deep ReLU Neural Networks

MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization

Evolving Normalization-Activation Layers

Restructuring Batch Normalization to Accelerate CNN Training

Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm