Abstract:Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

Convergence Analysis of an Adaptively Regularized Natural Gradient Method

Convergence Analysis of Gradient Algorithms on Riemannian Manifolds Without Curvature Constraints and Application to Riemannian Mass

Convergence Analysis of Graph Regularized Non-Negative Matrix Factorization

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Convergence analysis of the Gauss–Newton method for convex inclusion and convex-composite optimization problems

Convergence Behavior of Gauss-Newton's Method and Extensions of the Smale Point Estimate Theory.

Linear Convergence of Adaptive Stochastic Gradient Descent

Bound Analysis of Natural Gradient Descent in Stochastic Optimization Setting

On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration

Linear Convergence of Inexact Descent Method and Inexact Proximal Gradient Algorithms for Lower-Order Regularization Problems

Convergence Analysis of Asynchronous Stochastic Recursive Gradient Algorithms

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

Revisiting Convergence of AdaGrad with Relaxed Assumptions

On the Convergence of A Data-Driven Regularized Stochastic Gradient Descent for Nonlinear Ill-Posed Problems

A convergence analysis of the iteratively regularized Gauss-Newton method under Lipschitz condition

A Methodology Establishing Linear Convergence of Adaptive Gradient Methods under PL Inequality

Convergence Analysis of a Class of Nonsmooth Gradient Systems.

Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators

Convex and Non-convex Optimization Under Generalized Smoothness

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks