Abstract:Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $\alpha_L = \frac{1}{\sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = \frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

What problem does this paper attempt to address?

This paper mainly discusses the training stability problem of deep residual neural networks (ResNets) in the large-depth regime. ResNets have shown excellent performance in complex machine learning tasks, but as the depth increases, the training process needs to be carefully handled to avoid gradient vanishing or exploding. A common strategy is to add a scaling factor αL before the output of each layer, but there is no consensus on the optimal form of αL. The author found through probabilistic analysis that when the weights are initialized in a standard independent and identically distributed (i.i.d.) manner, only by choosing αL=1/√L, the network will have nontrivial behavior at initialization, while other choices may lead to gradient exploding or the network being almost equivalent to the identity mapping. In the continuous-time limit, this scaling factor corresponds to a neural stochastic differential equation (neural SDE), which is different from the discretized form of the neural ordinary differential equation (neural ODE) usually considered. Furthermore, the study also shows a close relationship between αL=1/√L, the smoothness of the weights, and training stability. The contributions of the paper include: 1. In-depth mathematical analysis of the dynamic behavior of ResNets at initialization, particularly regarding the choice of αL. 2. Clear explanation of the connection between αL=1/√L and neural SDE, while αL=1/L corresponds to neural ODE under specific correlated initialization. 3. Revealing the continuous range between αL and the smoothness of the weight function with respect to the layer index, which jointly affect the performance before and after training. The related work section mentions that despite numerous discussions on the choice of scaling for ResNets, there is still no clear consensus. This paper aims to provide a deeper understanding of the deep learning theory for ResNets through theoretical analysis.

Scaling ResNets in the Large-depth Regime

Improve Generalization Ability of Deep Wide Residual Network with A Suitable Scaling Factor

A Dynamical Model of Neural Scaling Laws

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Generalization of Scaled Deep ResNets in the Mean-Field Regime

Explaining Neural Scaling Laws

Graph Expansions of Deep Neural Networks and their Universal Scaling Limits

Scaling description of generalization with number of parameters in deep learning

Super Consistency of Neural Network Landscapes and Learning Rate Transfer

Stabilize deep ResNet with a sharp scaling factor τ

Neural Scaling Laws of Deep ReLU and Deep Operator Network: A Theoretical Study

Unified Neural Network Scaling Laws and Scale-time Equivalence

Doubly infinite residual neural networks: a diffusion process approach

Scaling Laws Beyond Backpropagation

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks

Universal Scaling Laws of Absorbing Phase Transitions in Artificial Deep Neural Networks

The Optimization Landscape of SGD Across the Feature Learning Strength

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

Disentangling feature and lazy training in deep neural networks

NeuralScale: Efficient Scaling of Neurons for Resource-Constrained Deep Neural Networks