Wide Deep Neural Networks with Gaussian Weights are Very Close to Gaussian Processes

Dario Trevisan
2023-12-19
Abstract:We establish novel rates for the Gaussian approximation of random deep neural networks with Gaussian parameters (weights and biases) and Lipschitz activation functions, in the wide limit. Our bounds apply for the joint output of a network evaluated any finite input set, provided a certain non-degeneracy condition of the infinite-width covariances holds. We demonstrate that the distance between the network output and the corresponding Gaussian approximation scales inversely with the width of the network, exhibiting faster convergence than the naive heuristic suggested by the central limit theorem. We also apply our bounds to obtain theoretical approximations for the exact Bayesian posterior distribution of the network, when the likelihood is a bounded Lipschitz function of the network output evaluated on a (finite) training set. This includes popular cases such as the Gaussian likelihood, i.e. exponential of minus the mean squared error.
Machine Learning,Statistics Theory,Probability
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: in the limit of infinite width, the relationship between the output of random deep neural networks (DNNs) with Gaussian weights and Gaussian processes (GPs). Specifically, the author establishes new convergence rates, which describe how the distance between the network output and the corresponding Gaussian approximation changes as the width of the neural network increases. ### Main Contributions 1. **New Convergence Rates**: - The author proves that the distance between the multi - dimensional output random deep neural network evaluated on a finite input set and the Gaussian process converges rapidly as the network width increases. This convergence speed is faster than that implied by the Central Limit Theorem (CLT). - The specific convergence rates are as follows: - If all covariance matrices \( K^{(\ell)}[X] \) are invertible, then: \[ W_p(f^{(\ell)}[X], G^{(\ell)}_{n_\ell}[X]) \leq c n_\ell \gamma_k n_\ell^{p} \sum_{i = 1}^{\ell - 1}\frac{1}{n_i} \] - In general: \[ W_p(f^{(\ell)}[X], G^{(\ell)}_{n_\ell}[X]) \leq c\sqrt{n_\ell}\sum_{i = 1}^{\ell - 1}\frac{1}{\sqrt{n_i}} \] 2. **Approximation of Bayesian Posterior Distribution**: - Using the above convergence rates, the author derives the convergence rate between the exact Bayesian posterior distribution of the deep random neural network and the approximation of the corresponding Gaussian process when the likelihood function is a bounded and Lipschitz - continuous function. - In particular, for the Gaussian likelihood function: \[ W_1(f^{(L)}[X]|D, G^{(L)}_{n_L}[X]|D) \leq c\sum_{\ell = 1}^{L - 1}\frac{1}{n_\ell} \] Or in general: \[ W_1(f^{(L)}[X]|D, G^{(L)}_{n_L}[X]|D) \leq c\sum_{\ell = 1}^{L - 1}\frac{1}{\sqrt{n_\ell}} \] ### Methods and Techniques - **Proof Techniques**: - The author is based on the basic idea of Neal (1996), that is, the Gaussian limit of each layer is caused by the scaling of the weight parameters by the Central Limit Theorem and the convergence inherited from the previous layer. - By introducing the concept of "Empirical Kernel" and using Gaussian approximation and the law of large numbers, the author proves the convergence of the empirical kernel as the network width increases. - Through backward reasoning, a stronger Gaussian approximation rate of the deep neural network is derived from the convergence rate of the empirical kernel. ### Significance and Impact - **Theoretical Significance**: - This research provides a new theoretical basis for understanding the behavior of deep neural networks when the width is infinite, especially their relationship with Gaussian processes. - These results help to explain why over - parameterized neural networks are so successful in practice. - **Practical Applications**: - By quantifying the approximation error of the Bayesian posterior distribution, this research provides a reliable theoretical basis for using Bayesian methods in deep learning. - These results can be applied to improve the training and inference of deep learning models, especially in terms of uncertainty and generalization ability. ### Open Problems - **Parameter Dependence**: - Future research can further explore the dependence of the constant \( c \) on the network depth, the number of neurons in each layer, the activation function, and the dimension of the input space. - **Derivatives**