Scaling ResNets in the Large-depth Regime

Pierre Marion,Adeline Fermanian,Gérard Biau,Jean-Philippe Vert
2024-06-10
Abstract:Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $\alpha_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $\alpha_L = \frac{1}{\sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $\alpha_L = \frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
Machine Learning
What problem does this paper attempt to address?
This paper mainly discusses the training stability problem of deep residual neural networks (ResNets) in the large-depth regime. ResNets have shown excellent performance in complex machine learning tasks, but as the depth increases, the training process needs to be carefully handled to avoid gradient vanishing or exploding. A common strategy is to add a scaling factor αL before the output of each layer, but there is no consensus on the optimal form of αL. The author found through probabilistic analysis that when the weights are initialized in a standard independent and identically distributed (i.i.d.) manner, only by choosing αL=1/√L, the network will have nontrivial behavior at initialization, while other choices may lead to gradient exploding or the network being almost equivalent to the identity mapping. In the continuous-time limit, this scaling factor corresponds to a neural stochastic differential equation (neural SDE), which is different from the discretized form of the neural ordinary differential equation (neural ODE) usually considered. Furthermore, the study also shows a close relationship between αL=1/√L, the smoothness of the weights, and training stability. The contributions of the paper include: 1. In-depth mathematical analysis of the dynamic behavior of ResNets at initialization, particularly regarding the choice of αL. 2. Clear explanation of the connection between αL=1/√L and neural SDE, while αL=1/L corresponds to neural ODE under specific correlated initialization. 3. Revealing the continuous range between αL and the smoothness of the weight function with respect to the layer index, which jointly affect the performance before and after training. The related work section mentions that despite numerous discussions on the choice of scaling for ResNets, there is still no clear consensus. This paper aims to provide a deeper understanding of the deep learning theory for ResNets through theoretical analysis.