Abstract:We prove a large deviation principle for deep neural networks with Gaussian weights and (at most linearly growing) activation functions. This generalises earlier work, in which bounded and continuous activation functions were considered. In practice, linearly growing activation functions such as ReLU are most commonly used. We furthermore simplify previous expressions for the rate function and a give power-series expansions for the ReLU case.
What problem does this paper attempt to address?
This paper mainly investigates the phenomenon of large bias in Gaussian neural networks with ReLU activation functions. Quirin Vogel, the author, proves the theorem that deep neural networks exhibit large bias behavior with a large number of parameters under activation functions allowing linear growth, such as ReLU. The paper extends previous work that only considered bounded and continuous activation functions and simplifies the expression of the rate function given before.
The paper first defines the model, where the neural network consists of multiple layers, weights, and biases, and uses activation functions with linear growth, such as ReLU. The author assumes that the weights and biases follow Gaussian distributions and introduces the method of transforming from "conditional large bias principle" to "global large bias principle." They point out that ReLU and other common activation functions satisfy these assumptions.
The main contributions of the paper include:
1. Proving that under ReLU activation, the random vector output of a neural network satisfies the principle of large bias, quantifying its anomalous behavior even in high-dimensional cases.
2. Providing a simplified form of the rate function, reducing the complexity of the optimization problem, and specifically giving a power series expansion in the case of ReLU activation function, which helps approximate calculations in high dimensions.
3. Discussing the key differences between linear growth activation functions and super-linear or sub-linear growth functions, where the large bias principle no longer belongs to the exponential class for faster-growing functions.
The paper also discusses relevant theoretical tools, such as Gaussian processes and convex analysis, to address the continuity issues of non-trivial gradients and conditional large bias principles. In addition, the author points out that these results apply to the neural network architecture before training, i.e., the case of randomly initialized weights and biases.
In summary, this paper provides a theoretical basis for understanding the statistical behavior of deep learning models with a large number of parameters under ReLU activation by deepening the study of Gaussian neural networks with ReLU activation. It is of significant importance for understanding and optimizing the training process of deep learning models.