Abstract:Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

Analysis of the expected $L_2$ error of an over-parametrized deep neural network estimate learned by gradient descent without regularization

On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Dropout Regularization Versus $\ell_2$-Penalization in the Linear Model

Preconditioned Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression

Global $\mathcal{L}^2$ minimization at uniform exponential rate via geometrically adapted gradient descent in Deep Learning

Gradient Descent Finds Over-Parameterized Neural Networks with Sharp Generalization for Nonparametric Regression: A Distribution-Free Analysis

Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

Nonparametric regression using over-parameterized shallow ReLU neural networks

Convergence Analysis for Over-Parameterized Deep Learning

High-Dimensional Linear Regression via Implicit Regularization

Regularization-wise double descent: Why it occurs and how to eliminate it

Convergence Analysis of Natural Gradient Descent for Over-parameterized Physics-Informed Neural Networks

An Improved Analysis of Training Over-parameterized Deep Neural Networks

On the Convergence of Gradient Descent for Large Learning Rates

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution

On the Lipschitz Constant of Deep Networks and Double Descent

Deep linear networks for regression are implicitly regularized towards flat minima

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Analysis of the Gradient Descent Algorithm for a Deep Neural Network Model with Skip-connections.