Abstract:We consider nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $\cO({1}/{n^{4\alpha/(4\alpha+1)}})$, which is sharper the current standard rate of $\cO({1}/{n^{2\alpha/(2\alpha+1)}})$ with $2\alpha = d/(d-1)$ when the data is distributed uniformly on the unit sphere in $\RR^d$ and $n$ is the size of the training data. When the target function has no spectral bias, we prove that neural network trained with regular GD with early stopping still enjoys minimax optimal rate, and in this case our results do not require distributional assumptions in contrast with the current known results. Our results are built upon two significant technical contributions. First, uniform convergence to the NTK is established during the training process by PGD or GD, so that we can have a nice decomposition of the neural network function at any step of GD or PGD into a function in the RKHS and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD or PGD. Our results also indicate that PGD can be another way of avoiding the usual linear regime of NTK and obtaining sharper generalization bound, because PGD induces a different kernel with lower kernel complexity during the training than the regular NTK induced by the network architecture trained by regular GD.

Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression: A Distribution-Free Analysis

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections.

Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel

Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond

Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

How many Neurons do we need? A refined Analysis for Shallow Networks trained with Gradient Descent

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning.

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks

On the Unstable Convergence Regime of Gradient Descent

Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

Stochastic Gradient Descent for Two-layer Neural Networks

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Analysis of the expected $L_2$ error of an over-parametrized deep neural network estimate learned by gradient descent without regularization

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning

GD doesn't make the cut: Three ways that non-differentiability affects neural network training

Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems

An Improved Analysis of Training Over-parameterized Deep Neural Networks