Abstract:Layer normalization (LN) is a ubiquitous technique in deep learning but our theoretical understanding to it remains elusive. This paper investigates a new theoretical direction for LN, regarding to its nonlinearity and representation capacity. We investigate the representation capacity of a network with layerwise composition of linear and LN transformations, referred to as LN-Net. We theoretically show that, given $m$ samples with any label assignment, an LN-Net with only 3 neurons in each layer and $O(m)$ LN layers can correctly classify them. We further show the lower bound of the VC dimension of an LN-Net. The nonlinearity of LN can be amplified by group partition, which is also theoretically demonstrated with mild assumption and empirically supported by our experiments. Based on our analyses, we consider to design neural architecture by exploiting and amplifying the nonlinearity of LN, and the effectiveness is supported by our experiments.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the theoretical understanding of the non - linearity and representational power of Layer Normalization (LN). Specifically: 1. **Proof of non - linearity**: The paper proves that LN is a non - linear transformation by defining and analyzing the "Sum of Squares Ratio" (SSR) and its Linear Invariant Lower Bound (LSSR). This is achieved by showing that a linear neural network containing LN can break the boundaries of LSSR. 2. **Analysis of representational power**: The paper further explores the representational power of networks containing LN (referred to as LN - Net). The author proves that an LN - Net with 3 neurons per layer and O(m) layers can correctly classify m samples with arbitrary label assignments. In addition, a lower bound on the VC dimension of LN - Net is also given. 3. **Amplification and utilization of non - linearity**: The paper studies how to amplify the non - linearity of LN through grouping (Group based LN, LN - G) and verifies the effectiveness of this design through experiments. Specifically, by dividing neurons into multiple groups and performing LN independently within each group, the non - linearity of the network can be enhanced, thereby improving its representational power and classification performance. ### Main contributions - **Proof of non - linearity**: It is the first time to theoretically prove that LN is a non - linear transformation. - **Analysis of representational power**: It shows that LN - Net has strong representational power in theory and can correctly classify samples with arbitrary label assignments. - **Amplification of non - linearity**: It proposes to amplify the non - linearity of LN through grouping (LN - G) and verifies its effectiveness through experiments. ### Experimental results - **Random label fitting experiment**: On the CIFAR - 10 - RL and MNIST - RL datasets, the performance of LN - Net is significantly better than that of linear neural networks, verifying the non - linearity of LN. - **Grouped LN experiment**: Through grouped LN (LN - G), the network can perfectly classify all random labels on CIFAR - 10 - RL and MNIST - RL, further verifying the non - linearity amplification effect of LN - G. ### Conclusion The paper proves the non - linear characteristics of LN through theoretical analysis and experiments, and shows that grouping LN (LN - G) can effectively amplify this non - linearity, thereby improving the representational power and classification performance of neural networks. These results provide theoretical basis and practical guidance for designing more efficient neural network architectures.

On the Nonlinearity of Layer Normalization

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Understanding the Role of Layer Normalization in Label-Skewed Federated Learning

Understanding and Improving Layer Normalization

Layer Normalization

On Layer Normalization in the Transformer Architecture

Re-Introducing LayerNorm: Geometric Meaning, Irreversibility and a Comparative Study with RMSNorm

Evolving Normalization-Activation Layers

Understanding and Improving Group Normalization

On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units

Order and Chaos: NTK views on DNN Normalization, Checkerboard and Boundary Artifacts

New Interpretations of Normalization Methods in Deep Learning.

Functional Network: A Novel Framework for Interpretability of Deep Neural Networks

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Geometry and Dynamics of LayerNorm

ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models

Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion

Local Context Normalization: Revisiting Local Normalization