Abstract:The approaches that prevent gradient explosion and vanishing have boosted the performance of deep neural networks in recent years. A unique one among them is the self-normalizing neural network (SNN), which is generally more stable than initialization techniques without explicit normalization. The self-normalization property of SNN in previous studies comes from the Scaled Exponential Linear Unit (SELU) activation function. %, which has achieved competitive accuracy on moderate-scale benchmarks. However, it has been shown that in deeper neural networks, SELU either leads to gradient explosion or loses its self-normalization property. Besides, its accuracy on large-scale benchmarks like ImageNet is less satisfying. In this paper, we analyze the forward and backward passes of SNN with mean-field theory and block dynamical isometry. A new definition for self-normalization property is proposed that is easier to use both analytically and numerically. A proposition is also proposed which enables us compare the strength of the self-normalization property between different activation functions. We further develop two new activation functions, leaky SELU (lSELU) and scaled SELU (sSELU), that have stronger self-normalization property. The optimal parameters in them can be easily solved with a constrained optimization program. Besides, analysis on the activation's mean in the forward pass reveals that the self-normalization property on mean gets weaker with larger fan-in, which explains the performance degradation on ImageNet. This can be solved with weight centralization, mixup data augmentation, and centralized activation function. On moderate-scale datasets CIFAR-10, CIFAR-100, and Tiny ImageNet, the direct application of lSELU and sSELU achieves up to 2.13% higher accuracy. On Conv MobileNet V1 - ImageNet, sSELU with Mixup, trainable λ, and centralized activation function reaches 71.95% accuracy that is even better than Batch Normalization.(code in Supplementary Material)

Stability and convergence theory for learning resnet: A full characterization

Stabilize deep ResNet with a sharp scaling factor τ

Forward Stability of ResNet and Its Variants

Normalized Activation Function: Toward Better Convergence

Convergence Analysis of Deep Residual Networks

Stable ResNet

Are deep ResNets provably better than linear predictors?

Demystifying ResNet

Training Over-parameterized Deep ResNet is Almost As Easy As Training a Two-layer Network

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Why ResNet Works? Residuals Generalize

How to Initialize your Network? Robust Initialization for WeightNorm & ResNets

A Convergence Theory Towards Practical Over-parameterized Deep Neural Networks

Residual Networks as Nonlinear Systems: Stability Analysis using Linearization

Towards Understanding the Importance of Shortcut Connections in Residual Networks

Learning Deep ResNet Blocks Sequentially using Boosting Theory

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

Gradient Descent Optimizes Normalization-Free ResNets.

Global Convergence of Gradient Descent for Deep Linear Residual Networks.

Redefining The Self-Normalization Property

Deeper Insights into Deep Graph Convolutional Networks: Stability and Generalization