Abstract:Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. \review{We also empirically investigate how the depth degeneracy phenomenon can negatively impact training of real networks.} The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to explore a phenomenon in deep neural networks (DNN), namely **depth degeneracy**. Specifically, as the number of network layers increases, the network at initialization gradually approaches a constant function and cannot distinguish different inputs. This phenomenon makes it difficult for deep networks to learn effectively in the early stages of training. To study this problem, the author pays special attention to the change in the angle (angle) between two input vectors in a fully - connected neural network with the ReLU activation function as the number of layers increases. Through combinatorial expansion and exact formula derivation, the author finds that the angle rapidly approaches zero as the depth increases, and this rate of change is exponential. In addition, the author also verifies the theoretical results through Monte Carlo experiments and shows how these results affect the training performance of actual networks. ### Main contributions 1. **Theoretical analysis and formula derivation**: The author derives an exact formula for the change of the angle with the number of layers and reveals the influence of microscopic fluctuations that are invisible in the infinite - width limit framework. 2. **Empirical verification**: The theoretical results are verified through Monte Carlo experiments, and the accuracy of these results in finite - width networks is demonstrated. 3. **Practical application**: The author explores the negative impact of the depth degeneracy phenomenon on network training and proposes a method based on the prediction of the initial angle for screening network architectures that may perform poorly, thereby optimizing neural architecture search. ### Formula presentation 1. **Angle evolution formula**: \[ \ln \sin^2(\theta_{\ell + 1}) \approx \ln \sin^2(\theta_\ell)-\frac{2}{3\pi}\theta_\ell-\rho(n_\ell) \] where, \[ \rho(n):=\ln\left(\frac{n + 5}{n-1}\right)-\frac{10n}{(n + 5)^2}+\frac{6n}{(n-1)^2}=\frac{2}{n}+O\left(\frac{1}{n^2}\right) \] 2. **Mean and variance formulas**: \[ E[\ln \sin^2(\theta_{\ell+1})]=\mu(\theta_\ell,n_\ell)+O\left(\frac{1}{n_\ell^2}\right) \] \[ \text{Var}[\ln \sin^2(\theta_{\ell+1})]=\sigma^2(\theta_\ell,n_\ell)+O\left(\frac{1}{n_\ell^2}\right) \] where, \[ \mu(\theta,n)=\ln \sin^2\theta-\frac{2}{3\pi}\theta-\rho(n)-\frac{8\theta}{15\pi n}-\left(\frac{2}{9\pi^2}-\frac{68}{45\pi^2 n}\right)\theta^2+O(\theta^3) \] \[ \sigma^2(\theta,n)=\frac{8}{n}-\frac{64}{15\pi}\frac{\theta}{n}-\left(8+\frac{296}{45\pi}\right)\frac{\theta^2}{n}+O(\theta^3) \] ### Conclusion Through in - depth analysis of angle evolution, this paper reveals the essence of the depth degeneracy phenomenon and proposes effective mathematical models to describe this process. This not only helps to understand the initialization behavior of deep neural networks but also provides new perspectives and tools for optimizing network architecture design.

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Network Degeneracy as an Indicator of Training Performance: Comparing Finite and Infinite Width Angle Predictions

Neural networks with ReLU powers need less depth

Towards Lower Bounds on the Depth of ReLU Neural Networks

Quasi-Equivalence of Width and Depth of Neural Networks

A Deep Conditioning Treatment of Neural Networks

Initialization Matters: Privacy-Utility Analysis of Overparameterized Neural Networks

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Implicit Hypersurface Approximation Capacity in Deep ReLU Networks

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Compelling ReLU Networks to Exhibit Exponentially Many Linear Regions at Initialization and During Training

Bidirectionally Self-Normalizing Neural Networks

Critical Initialization of Wide and Deep Neural Networks through Partial Jacobians: General Theory and Applications

Topological obstruction to the training of shallow ReLU neural networks

Scaling ResNets in the Large-depth Regime

Luck Matters: Understanding Training Dynamics of Deep ReLU Networks

On the Depth of Deep Neural Networks: A Theoretical View

Depth Separation in Norm-Bounded Infinite-Width Neural Networks

Principles for Initialization and Architecture Selection in Graph Neural Networks with ReLU Activations

On Minimal Depth in Neural Networks

Three Quantization Regimes for ReLU Networks