Abstract:We consider nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $\cO({1}/{n^{4\alpha/(4\alpha+1)}})$, which is sharper the current standard rate of $\cO({1}/{n^{2\alpha/(2\alpha+1)}})$ with $2\alpha = d/(d-1)$ when the data is distributed uniformly on the unit sphere in $\RR^d$ and $n$ is the size of the training data. When the target function has no spectral bias, we prove that neural network trained with regular GD with early stopping still enjoys minimax optimal rate, and in this case our results do not require distributional assumptions in contrast with the current known results. Our results are built upon two significant technical contributions. First, uniform convergence to the NTK is established during the training process by PGD or GD, so that we can have a nice decomposition of the neural network function at any step of GD or PGD into a function in the RKHS and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD or PGD. Our results also indicate that PGD can be another way of avoiding the usual linear regime of NTK and obtaining sharper generalization bound, because PGD induces a different kernel with lower kernel complexity during the training than the regular NTK induced by the network architecture trained by regular GD.

Gradient Descent can Learn Less Over-parameterized Two-layer Neural Networks on Classification Problems

Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

On the convergence of gradient descent for two layer neural networks

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks.

A Comparative Analysis of Optimization and Generalization Properties of Two-Layer Neural Network and Random Feature Models under Gradient Descent Dynamics

A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

Fast Convergence in Learning Two-Layer Neural Networks with Separable Data

On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Stochastic Gradient Descent for Two-layer Neural Networks

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Gradient Descent Provably Escapes Saddle Points in the Training of Shallow ReLU Networks

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Feature selection with gradient descent on two-layer networks in low-rotation regimes

How Does Gradient Descent Learn Features -- A Local Analysis for Regularized Two-Layer Neural Networks

A Geometric Approach of Gradient Descent Algorithms in Linear Neural Networks

On the Unstable Convergence Regime of Gradient Descent

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks