Abstract:Momentum methods, such as heavy ball method (HB) and Nesterov’s accelerated gradient method (NAG), have been widely used in training neural networks by incorporating the history of gradients into the current updating process. In practice, they often provide improved performance over (stochastic) gradient descent (GD) with faster convergence. Despite their empirical success, a theoretical understanding of their accelerated convergence rates is still insufficient. Recently, some attempts have been made by analyzing the trajectories of gradient-based methods in an over-parameterized regime, where the number of the parameters is significantly larger than that of the training instances. However, the majority of existing theoretical work is mainly concerned with GD and the established convergence result of NAG is inferior to HB and GD, which fails to explain the practical success of NAG. In this paper, we take a step towards closing this gap by analyzing NAG in training a randomly initialized over-parameterized two-layer fully connected neural network with ReLU activation. Despite the fact that the objective function is non-convex and non-smooth, we show that NAG converges to a global minimum at a non-asymptotic linear rate (1−Θ(1/κ))t, where κ>1 is the condition number of a gram matrix and t is the number of the iterations. Compared to the convergence rate (1−Θ(1/κ))t of GD, our result provides theoretical guarantees for the acceleration of NAG in neural network training. Furthermore, our findings suggest that NAG and HB have similar convergence rate. Finally, extensive experiments on six benchmark datasets have been conducted to validate the correctness of our theoretical results.

When Will Gradient Regularization Be Harmful?

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

An Adaptive Gradient Regularization Method

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning.

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Preconditioning for Accelerated Gradient Descent Optimization and Regularization

The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Gradient-Coherent Strong Regularization for Deep Neural Networks

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

On the Unstable Convergence Regime of Gradient Descent

On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them: A Gradient-Norm Perspective

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

GrOD : Deep Learning with Gradients Orthogonal Decomposition for Knowledge Transfer, Distillation, and Adversarial Training

Provable convergence of Nesterov’s accelerated gradient method for over-parameterized neural networks

Regularization and Reparameterization Avoid Vanishing Gradients in Sigmoid-Type Networks

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

Loss Gradient Gaussian Width based Generalization and Optimization Guarantees

Provable Acceleration of Nesterov's Accelerated Gradient Method over Heavy Ball Method in Training Over-Parameterized Neural Networks

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network