Abstract:The approaches that prevent gradient explosion and vanishing have boosted the performance of deep neural networks in recent years. A unique one among them is the self-normalizing neural network (SNN), which is generally more stable than initialization techniques without explicit normalization. The self-normalization property of SNN in previous studies comes from the Scaled Exponential Linear Unit (SELU) activation function. %, which has achieved competitive accuracy on moderate-scale benchmarks. However, it has been shown that in deeper neural networks, SELU either leads to gradient explosion or loses its self-normalization property. Besides, its accuracy on large-scale benchmarks like ImageNet is less satisfying. In this paper, we analyze the forward and backward passes of SNN with mean-field theory and block dynamical isometry. A new definition for self-normalization property is proposed that is easier to use both analytically and numerically. A proposition is also proposed which enables us compare the strength of the self-normalization property between different activation functions. We further develop two new activation functions, leaky SELU (lSELU) and scaled SELU (sSELU), that have stronger self-normalization property. The optimal parameters in them can be easily solved with a constrained optimization program. Besides, analysis on the activation's mean in the forward pass reveals that the self-normalization property on mean gets weaker with larger fan-in, which explains the performance degradation on ImageNet. This can be solved with weight centralization, mixup data augmentation, and centralized activation function. On moderate-scale datasets CIFAR-10, CIFAR-100, and Tiny ImageNet, the direct application of lSELU and sSELU achieves up to 2.13% higher accuracy. On Conv MobileNet V1 - ImageNet, sSELU with Mixup, trainable λ, and centralized activation function reaches 71.95% accuracy that is even better than Batch Normalization.(code in Supplementary Material)

Normalized Activation Function: Toward Better Convergence

ANAct: Adaptive Normalization for Activation Functions

Activation Functions: Dive into an optimal activation function

Effect of Activation Functions on the Training of Overparametrized Neural Nets

Redefining The Self-Normalization Property

Evolving Normalization-Activation Layers

On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units

Activation function optimization method: Learnable series linear units (LSLUs)

A generic shift-norm-activation approach for deep learning

An Investigation of the Impact of Normalization Schemes on GCN Modelling

Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Activation function optimization scheme for image classification

Analysis of the rate of convergence of fully connected deep neural network regression estimates with smooth activation function

Unified Normalization for Accelerating and Stabilizing Transformers

Understanding and Improving Layer Normalization

Convergence of Deep Neural Networks with General Activation Functions and Pooling

A Non-monotonic Smooth Activation Function

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

A Method on Searching Better Activation Functions

Adaptive Parametric Activation