Abstract:We consider functions from the real numbers to the real numbers, output by a neural network with 1 hidden activation layer, arbitrary width, and ReLU activation function. We assume that the parameters of the neural network are chosen uniformly at random with respect to various probability distributions, and compute the expected distribution of the points of non-linearity. We use these results to explain why the network may be biased towards outputting functions with simpler geometry, and why certain functions with low information-theoretic complexity are nonetheless hard for a neural network to approximate.
What problem does this paper attempt to address?
The paper primarily explores the issue of nonlinear point distribution in the generation functions of random neural networks and attempts to explain why certain functions with lower information-theoretic complexity are still difficult for neural networks to approximate.
Specifically, the paper considers a special class of neural networks that contain a single hidden layer, use ReLU activation functions, and have parameters (weights and biases) that are uniformly randomly selected. The focus of the study is on analyzing the number and distribution of the nonlinear points (i.e., points where the function behavior changes) of the functions output by these networks. Through theoretical analysis, the authors find that the distribution of these nonlinear points is closely related to the selection of the parameter space.
The main contributions of the paper include:
1. **Rectangular Parameter Space**: When weights and biases are uniformly randomly selected from a finite interval, the number of nonlinear points of the function follows a binomial distribution. Additionally, the paper derives the expected number of nonlinear points under different circumstances.
2. **Gaussian Parameter Space**: When weights and biases are selected from a normal distribution, the probability distribution of the number of nonlinear points is similar to that in the rectangular parameter space, but the coefficients in the probability formula differ.
3. **Spherical Parameter Space**: In this case, biases are still uniformly randomly selected from a finite interval, while weights are uniformly randomly selected from within a sphere. For finite domains, the paper provides an expression for the expected number of nonlinear points and discusses its asymptotic behavior.
Through these results, the paper further explains why some functions, despite having low information-theoretic complexity, may still be difficult for neural networks to approximate. For example, a periodic sawtooth wave function, although low in information-theoretic complexity, is relatively difficult for neural networks using ReLU activation functions to learn.
In summary, this study reveals the potential biases of neural networks when dealing with specific types of functions and provides a theoretical basis for further understanding the learning capabilities of neural networks.