Abstract:We consider nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $\cO({1}/{n^{4\alpha/(4\alpha+1)}})$, which is sharper the current standard rate of $\cO({1}/{n^{2\alpha/(2\alpha+1)}})$ with $2\alpha = d/(d-1)$ when the data is distributed uniformly on the unit sphere in $\RR^d$ and $n$ is the size of the training data. When the target function has no spectral bias, we prove that neural network trained with regular GD with early stopping still enjoys minimax optimal rate, and in this case our results do not require distributional assumptions in contrast with the current known results. Our results are built upon two significant technical contributions. First, uniform convergence to the NTK is established during the training process by PGD or GD, so that we can have a nice decomposition of the neural network function at any step of GD or PGD into a function in the RKHS and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD or PGD. Our results also indicate that PGD can be another way of avoiding the usual linear regime of NTK and obtaining sharper generalization bound, because PGD induces a different kernel with lower kernel complexity during the training than the regular NTK induced by the network architecture trained by regular GD.

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning.

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Mitigating Gradient Overlap in Deep Residual Networks with Gradient Normalization for Improved Non-Convex Optimization

When Will Gradient Regularization Be Harmful?

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

ZNorm: Z-Score Gradient Normalization Accelerating Skip-Connected Network Training without Architectural Modification

An Adaptive Gradient Regularization Method

Neighborhood Region Smoothing Regularization for Finding Flat Minima in Deep Neural Networks

Improving Generalization of Deep Neural Networks by Optimum Shifting

Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization

Loss Gradient Gaussian Width based Generalization and Optimization Guarantees

Optimization and Generalization Guarantees for Weight Normalization

Regularized Gauss-Newton for Optimizing Overparameterized Neural Networks

Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

Understanding the Generalization Benefits of Late Learning Rate Decay

Gradient Correction Beyond Gradient Descent

Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression: A Distribution-Free Analysis

Learning Gradient Descent: Better Generalization and Longer Horizons