Abstract:As the training process of deep neural networks involves expensive computational cost, speeding up the convergence is of great importance. Nesterov’s accelerated gradient (NAG) is one of the most popular accelerated optimizers in the deep learning community, which often exhibits improved convergence performance over gradient descent (GD) in practice. However, theoretical investigations of NAG mainly focus on the convex setting. Since the optimization landscape of the neural network is non-convex, little is known about the convergence and acceleration of NAG. Nowadays, some works make progress towards understanding the convergence of NAG in training over-parameterized neural networks, where the number of the parameters exceeds that of the training instances. Nonetheless, previous studies are limited to the two-layer neural network, which are far from explaining the remarkable success of NAG in optimizing deep neural networks. In this paper, we investigate the convergence of NAG in training two architectures of deep linear networks: deep fully-connected linear neural networks and deep linear ResNets. Based on the over-parameterization regime, we first analyze the residual dynamics induced by the training trajectory of NAG for a deep fully-connected linear neural network under random Gaussian initialization. Our results show that NAG can converge to the global minimum at a (1-O(1/κ))t rate when the width is near-linear in the depth of the network, where t is the number of iterations and κ>1 is a constant depending on the condition number of the feature matrix. Compared to the (1-O(1/κ))t rate of GD, NAG achieves an acceleration over GD. For deep linear ResNets, we utilize the same analytical approach and obtain a similar convergence result, while the width requirement is independent of the depth. To the best of our knowledge, these are the first theoretical guarantees for the convergence and acceleration of NAG in training deep neural networks. Numerical results show the acceleration of NAG compared to GD in terms of iterations. In addition, we conduct experiments to evaluate the effect of the depth on the convergence rate of NAG, which validate our derived conditions of the width. We hope our results may shed light on understanding the optimization behavior of NAG for modern deep neural networks.

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Convergence results for gradient flow and gradient descent systems in the artificial neural network training

Non-convergence to global minimizers for Adam and stochastic gradient descent optimization and constructions of local minimizers in the training of artificial neural networks

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Normalized gradient flow optimization in the training of ReLU artificial neural networks

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

On the Convergence of Gradient Descent for Large Learning Rates

Provable convergence of Nesterov’s accelerated gradient method for over-parameterized neural networks

Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections.

A Geometric Approach of Gradient Descent Algorithms in Linear Neural Networks

Convergence guarantees for forward gradient descent in the linear regression model

A convergence analysis of Nesterov’s accelerated gradient method in training deep linear neural networks

Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond

GD doesn't make the cut: Three ways that non-differentiability affects neural network training

A Recipe for Global Convergence Guarantee in Deep Neural Networks

Analysis of Boundedness and Convergence of Online Gradient Method for Two-Layer Feedforward Neural Networks

Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation