Abstract:We consider nonparametric regression by an over-parameterized two-layer neural network trained by gradient descent (GD) or its variant in this paper. We show that, if the neural network is trained with a novel Preconditioned Gradient Descent (PGD) with early stopping and the target function has spectral bias widely studied in the deep learning literature, the trained network renders a particularly sharp generalization bound with a minimax optimal rate of $\cO({1}/{n^{4\alpha/(4\alpha+1)}})$, which is sharper the current standard rate of $\cO({1}/{n^{2\alpha/(2\alpha+1)}})$ with $2\alpha = d/(d-1)$ when the data is distributed uniformly on the unit sphere in $\RR^d$ and $n$ is the size of the training data. When the target function has no spectral bias, we prove that neural network trained with regular GD with early stopping still enjoys minimax optimal rate, and in this case our results do not require distributional assumptions in contrast with the current known results. Our results are built upon two significant technical contributions. First, uniform convergence to the NTK is established during the training process by PGD or GD, so that we can have a nice decomposition of the neural network function at any step of GD or PGD into a function in the RKHS and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD or PGD. Our results also indicate that PGD can be another way of avoiding the usual linear regime of NTK and obtaining sharper generalization bound, because PGD induces a different kernel with lower kernel complexity during the training than the regular NTK induced by the network architecture trained by regular GD.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to train an over - parameterized two - layer neural network through gradient descent to achieve a sharper generalization bound in non - parametric regression. Specifically, the paper focuses on what convergence rates can be achieved by neural networks trained with pre - conditioned gradient descent (PGD) or regular gradient descent (GD) when the target function has spectral bias or is a general target function.
### Main Problems
1. **Spectral - bias target function**:
- When the target function \( f^* \) has spectral bias (i.e., \( f^* \in H_K^{(\text{int})}(f_0) \subseteq H_K(f_0) \)) and the data distribution is a uniform distribution on the unit ball, the paper proves that the neural network trained with pre - conditioned gradient descent (PGD) can achieve a faster convergence rate of \( O\left(\frac{1}{n^{\frac{4\alpha}{4\alpha + 1}}}\right) \), which is faster than the existing optimal result of \( O\left(\frac{1}{n^{\frac{2\alpha}{2\alpha + 1}}}\right) \).
- This faster convergence rate is achieved by avoiding the usual linear approximation region (determined by NTK), because PGD induces a different kernel during the training process, which has a lower kernel complexity.
2. **General target function**:
- For the general target function \( f^* \in H_K(f_0) \), the paper proves that the neural network trained with regular gradient descent (GD) can achieve the optimal convergence rate of \( O\left(\frac{1}{n^{\frac{2\alpha}{2\alpha + 1}}}\right) \) without the assumption of data distribution.
- This result is applicable not only to the uniform distribution on the unit ball, but also to a wider range of data distributions. In addition, the paper also gives a specific quantification of the early - stopping time \( T \), such that the number of steps \( T \leq \hat{T}_K \) in the training process, where \( \hat{T}_K \) is the stopping time related to the kernel \( K \).
### Main Contributions
1. **Sharper generalization bound**:
- By introducing pre - conditioned gradient descent (PGD), the paper shows how to achieve a faster convergence rate than existing methods when the target function has spectral bias.
- This method avoids the usual linear approximation region, thereby obtaining a sharper generalization bound.
2. **No data distribution assumption**:
- For the general target function, the paper proves that even without the assumption of data distribution, the neural network trained with regular gradient descent (GD) can still achieve the optimal convergence rate.
- This result extends the scope of application of the existing theory, making it more practical.
3. **Quantification of early - stopping time**:
- The paper gives a specific quantification of the early - stopping time, which is of great significance for the training process in practical applications.
### Technical Contributions
1. **Uniform convergence to NTK**:
- The paper establishes the result of uniform convergence to NTK during the training process through PGD or GD, which makes it possible to decompose the neural network function at any step into a function in an RKHS and an error function with a small \( L^\infty \) norm.
2. **Local Rademacher complexity**:
- The paper uses local Rademacher complexity to tightly bound the Rademacher complexity of all possible neural network functions generated by GD or PGD.
These technical contributions provide a new perspective for understanding the training dynamics of over - parameterized neural networks and a theoretical basis for achieving sharper generalization bounds.