Gradient Descent Optimizes Normalization-Free ResNets.

Zongpeng Zhang,Zenan Ling,Tong Lin,Zhouchen Lin
DOI: https://doi.org/10.1109/ijcnn54540.2023.10191204
2023-01-01
Abstract:Recent empirical studies observe that even without normalization, a deep residual network can be trained reliably. We call such a structure as normalization-free Residual Networks (N-F ResNets), which add a learnable parameter $\alpha$ to control the scale of the residual block instead of normalization. However, the theoretical understanding on N-F ResNets is still limited despite their empirical success. In this paper, we provide the first theoretical understanding of N-F ResNets from two perspectives. Firstly, we prove that the gradient descent (GD) algorithm can find the global minimum of the training loss at a linear rate for over-parameterized N-F ResNets. Secondly, we prove that N-F ResNets can avoid the gradient exploding or vanishing problem, by initializing the key parameter $\alpha$ to be a small constant. Notably, we demonstrate that the gradients of N-F ResNets are more stable than those of ResNets with Kaiming initialization. Moreover, empirical experiments on benchmark datasets verify our theoretical results.
What problem does this paper attempt to address?