Abstract:We establish novel rates for the Gaussian approximation of random deep neural networks with Gaussian parameters (weights and biases) and Lipschitz activation functions, in the wide limit. Our bounds apply for the joint output of a network evaluated any finite input set, provided a certain non-degeneracy condition of the infinite-width covariances holds. We demonstrate that the distance between the network output and the corresponding Gaussian approximation scales inversely with the width of the network, exhibiting faster convergence than the naive heuristic suggested by the central limit theorem. We also apply our bounds to obtain theoretical approximations for the exact Bayesian posterior distribution of the network, when the likelihood is a bounded Lipschitz function of the network output evaluated on a (finite) training set. This includes popular cases such as the Gaussian likelihood, i.e. exponential of minus the mean squared error.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: in the limit of infinite width, the relationship between the output of random deep neural networks (DNNs) with Gaussian weights and Gaussian processes (GPs). Specifically, the author establishes new convergence rates, which describe how the distance between the network output and the corresponding Gaussian approximation changes as the width of the neural network increases. ### Main Contributions 1. **New Convergence Rates**: - The author proves that the distance between the multi - dimensional output random deep neural network evaluated on a finite input set and the Gaussian process converges rapidly as the network width increases. This convergence speed is faster than that implied by the Central Limit Theorem (CLT). - The specific convergence rates are as follows: - If all covariance matrices \( K^{(\ell)}[X] \) are invertible, then: \[ W_p(f^{(\ell)}[X], G^{(\ell)}_{n_\ell}[X]) \leq c n_\ell \gamma_k n_\ell^{p} \sum_{i = 1}^{\ell - 1}\frac{1}{n_i} \] - In general: \[ W_p(f^{(\ell)}[X], G^{(\ell)}_{n_\ell}[X]) \leq c\sqrt{n_\ell}\sum_{i = 1}^{\ell - 1}\frac{1}{\sqrt{n_i}} \] 2. **Approximation of Bayesian Posterior Distribution**: - Using the above convergence rates, the author derives the convergence rate between the exact Bayesian posterior distribution of the deep random neural network and the approximation of the corresponding Gaussian process when the likelihood function is a bounded and Lipschitz - continuous function. - In particular, for the Gaussian likelihood function: \[ W_1(f^{(L)}[X]|D, G^{(L)}_{n_L}[X]|D) \leq c\sum_{\ell = 1}^{L - 1}\frac{1}{n_\ell} \] Or in general: \[ W_1(f^{(L)}[X]|D, G^{(L)}_{n_L}[X]|D) \leq c\sum_{\ell = 1}^{L - 1}\frac{1}{\sqrt{n_\ell}} \] ### Methods and Techniques - **Proof Techniques**: - The author is based on the basic idea of Neal (1996), that is, the Gaussian limit of each layer is caused by the scaling of the weight parameters by the Central Limit Theorem and the convergence inherited from the previous layer. - By introducing the concept of "Empirical Kernel" and using Gaussian approximation and the law of large numbers, the author proves the convergence of the empirical kernel as the network width increases. - Through backward reasoning, a stronger Gaussian approximation rate of the deep neural network is derived from the convergence rate of the empirical kernel. ### Significance and Impact - **Theoretical Significance**: - This research provides a new theoretical basis for understanding the behavior of deep neural networks when the width is infinite, especially their relationship with Gaussian processes. - These results help to explain why over - parameterized neural networks are so successful in practice. - **Practical Applications**: - By quantifying the approximation error of the Bayesian posterior distribution, this research provides a reliable theoretical basis for using Bayesian methods in deep learning. - These results can be applied to improve the training and inference of deep learning models, especially in terms of uncertainty and generalization ability. ### Open Problems - **Parameter Dependence**: - Future research can further explore the dependence of the constant \( c \) on the network depth, the number of neurons in each layer, the activation function, and the dimension of the input space. - **Derivatives**

Wide Deep Neural Networks with Gaussian Weights are Very Close to Gaussian Processes

Deep Neural Networks as Gaussian Processes

Wide Neural Networks with Bottlenecks are Deep Gaussian Processes

Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection

Deep neural networks with dependent weights: Gaussian Process mixture limit, heavy tails, sparsity and compressibility

Posterior Inference on Shallow Infinitely Wide Bayesian Neural Networks under Weights with Unbounded Variance

Normal approximation of Random Gaussian Neural Networks

Proportional infinite-width infinite-depth limit for deep linear neural networks

Random ReLU Neural Networks as Non-Gaussian Processes

Deep Kernel Posterior Learning under Infinite Variance Prior Weights

Wide Neural Networks as Gaussian Processes: Lessons from Deep Equilibrium Models

Non-asymptotic approximations of Gaussian neural networks via second-order Poincaré inequalities

Wide neural networks: From non-gaussian random fields at initialization to the NTK geometry of training

Neural Network Gaussian Processes by Increasing Depth

Gaussian random field approximation via Stein's method with applications to wide random neural networks

Bayesian inference with finitely wide neural networks

Convergence rates of non-stationary and deep Gaussian process regression

Quantitative CLTs in Deep Neural Networks

Quantitative convergence of trained quantum neural networks to a Gaussian process

Deep quantum neural networks form Gaussian processes

Deep Gaussian Processes with Importance-Weighted Variational Inference