Abstract:Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

Adaptive Stochastic Gradient Langevin Dynamics: Taming Convergence and Saddle Point Escape Time

Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials

Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

Stability and convergence analysis of AdaGrad for non-convex optimization via novel stopping time-based techniques

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

An SDE Perspective on Stochastic Inertial Gradient Dynamics with Time-Dependent Viscosity and Geometric Damping

Convergence Error Analysis of Reflected Gradient Langevin Dynamics for Globally Optimizing Non-Convex Constrained Problems

An Algebraically Converging Stochastic Gradient Descent Algorithm for Global Optimization

Universal Gradient Descent Ascent Method for Nonconvex-Nonconcave Minimax Optimization

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

On the Sublinear Convergence of Randomly Perturbed Alternating Gradient Descent to Second Order Stationary Solutions

Stochastic Approximate Gradient Descent via the Langevin Algorithm

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

Dissipative Gradient Descent Ascent Method: A Control Theory Inspired Algorithm for Min-max Optimization

A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems

Escape Saddle Points by a Simple Gradient-Descent Based Algorithm

Low-Precision Stochastic Gradient Langevin Dynamics

Hitting Time of Stochastic Gradient Langevin Dynamics to Stationary Points: A Direct Analysis

Faster single-loop algorithms for minimax optimization without strong concavity

Adaptive Non-reversible Stochastic Gradient Langevin Dynamics

TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization