Abstract:Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

On the O(√(d)/T^1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ℓ_1 Norm

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Adaptive Learning Rates with Maximum Variation Averaging.

The Anytime Convergence of Stochastic Gradient Descent with Momentum: From a Continuous-Time Perspective

Random Reshuffling with Momentum for Nonconvex Problems: Iteration Complexity and Last Iterate Convergence

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Convergence Rate Analysis for Deep Ritz Method

A Qualitative Study of the Dynamic Behavior for Adaptive Gradient Algorithms

Almost sure convergence rates of stochastic gradient methods under gradient domination

Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees

Stable Gradient-Adjusted Root Mean Square Propagation on Least Squares Problem

Stochastic gradient descent algorithms for strongly convex functions at O(1/T) convergence rates

Convergence Rate Analysis of LION

On the Convergence of Memory-Based Distributed SGD.

Gradient Temporal Difference with Momentum: Stability and Convergence

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions