Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Cameron Jakub,Mihai Nica
2024-08-15
Abstract:Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. \review{We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks.} The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments.
Machine Learning,Probability
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to explore a phenomenon in deep neural networks (DNN), namely **depth degeneracy**. Specifically, as the number of network layers increases, the network at initialization gradually approaches a constant function and cannot distinguish different inputs. This phenomenon makes it difficult for deep networks to learn effectively in the early stages of training. To study this problem, the author pays special attention to the change in the angle (angle) between two input vectors in a fully - connected neural network with the ReLU activation function as the number of layers increases. Through combinatorial expansion and exact formula derivation, the author finds that the angle rapidly approaches zero as the depth increases, and this rate of change is exponential. In addition, the author also verifies the theoretical results through Monte Carlo experiments and shows how these results affect the training performance of actual networks. ### Main contributions 1. **Theoretical analysis and formula derivation**: The author derives an exact formula for the change of the angle with the number of layers and reveals the influence of microscopic fluctuations that are invisible in the infinite - width limit framework. 2. **Empirical verification**: The theoretical results are verified through Monte Carlo experiments, and the accuracy of these results in finite - width networks is demonstrated. 3. **Practical application**: The author explores the negative impact of the depth degeneracy phenomenon on network training and proposes a method based on the prediction of the initial angle for screening network architectures that may perform poorly, thereby optimizing neural architecture search. ### Formula presentation 1. **Angle evolution formula**: \[ \ln \sin^2(\theta_{\ell + 1}) \approx \ln \sin^2(\theta_\ell)-\frac{2}{3\pi}\theta_\ell-\rho(n_\ell) \] where, \[ \rho(n):=\ln\left(\frac{n + 5}{n-1}\right)-\frac{10n}{(n + 5)^2}+\frac{6n}{(n-1)^2}=\frac{2}{n}+O\left(\frac{1}{n^2}\right) \] 2. **Mean and variance formulas**: \[ E[\ln \sin^2(\theta_{\ell+1})]=\mu(\theta_\ell,n_\ell)+O\left(\frac{1}{n_\ell^2}\right) \] \[ \text{Var}[\ln \sin^2(\theta_{\ell+1})]=\sigma^2(\theta_\ell,n_\ell)+O\left(\frac{1}{n_\ell^2}\right) \] where, \[ \mu(\theta,n)=\ln \sin^2\theta-\frac{2}{3\pi}\theta-\rho(n)-\frac{8\theta}{15\pi n}-\left(\frac{2}{9\pi^2}-\frac{68}{45\pi^2 n}\right)\theta^2+O(\theta^3) \] \[ \sigma^2(\theta,n)=\frac{8}{n}-\frac{64}{15\pi}\frac{\theta}{n}-\left(8+\frac{296}{45\pi}\right)\frac{\theta^2}{n}+O(\theta^3) \] ### Conclusion Through in - depth analysis of angle evolution, this paper reveals the essence of the depth degeneracy phenomenon and proposes effective mathematical models to describe this process. This not only helps to understand the initialization behavior of deep neural networks but also provides new perspectives and tools for optimizing network architecture design.