Abstract:This paper aims to discuss the impact of random initialization of neural networks in the neural tangent kernel (NTK) theory, which is ignored by most recent works in the NTK theory. It is well known that as the network's width tends to infinity, the neural network with random initialization converges to a Gaussian process $f^{\mathrm{GP}}$, which takes values in $L^{2}(\mathcal{X})$, where $\mathcal{X}$ is the domain of the data. In contrast, to adopt the traditional theory of kernel regression, most recent works introduced a special mirrored architecture and a mirrored (random) initialization to ensure the network's output is identically zero at initialization. Therefore, it remains a question whether the conventional setting and mirrored initialization would make wide neural networks exhibit different generalization capabilities. In this paper, we first show that the training dynamics of the gradient flow of neural networks with random initialization converge uniformly to that of the corresponding NTK regression with random initialization $f^{\mathrm{GP}}$. We then show that $\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 1$ for any $s < \frac{3}{d+1}$ and $\mathbf{P}(f^{\mathrm{GP}} \in [\mathcal{H}^{\mathrm{NT}}]^{s}) = 0$ for any $s \geq \frac{3}{d+1}$, where $[\mathcal{H}^{\mathrm{NT}}]^{s}$ is the real interpolation space of the RKHS $\mathcal{H}^{\mathrm{NT}}$ associated with the NTK. Consequently, the generalization error of the wide neural network trained by gradient descent is $\Omega(n^{-\frac{3}{d+3}})$, and it still suffers from the curse of dimensionality. On one hand, the result highlights the benefits of mirror initialization. On the other hand, it implies that NTK theory may not fully explain the superior performance of neural networks.

Spurious Local Minima of Deep ReLU Neural Networks in the Neural Tangent Kernel Regime

Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?

Dynamics of Deep Neural Networks and Neural Tangent Hierarchy

Neural Tangent Kernel Beyond the Infinite-Width Limit: Effects of Depth and Initialization

On the Omnipresence of Spurious Local Minima in Certain Neural Network Training Problems

Spectral Analysis of the Neural Tangent Kernel for Deep Residual Networks

Neural Networks with Sparse Activation Induced by Large Bias: Tighter Analysis with Bias-Generalized NTK

Towards an Understanding of Residual Networks Using Neural Tangent Hierarchy (NTH)

Spurious Local Minima Are Common for Deep Neural Networks with Piecewise Linear Activations

On Random Kernels of Residual Architectures

Exact Convergence Rates of the Neural Tangent Kernel in the Large Depth Limit

On the Generalization Power of Overfitted Two-Layer Neural Tangent Kernel Models

Depth Creates No Bad Local Minima

Tensor Programs II: Neural Tangent Kernel for Any Architecture

On the Impacts of the Random Initialization in the Neural Tangent Kernel Theory

Fixing the NTK: From Neural Network Linearizations to Exact Convex Programs

A Revision of Neural Tangent Kernel-based Approaches for Neural Networks

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks

Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes

Elimination of All Bad Local Minima in Deep Learning