Abstract:Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Improved Analysis of Clipping Algorithms for Non-convex Optimization

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Adaptive Smoothing Gradient Learning for Spiking Neural Networks.

AdaGrad under Anisotropic Smoothness

Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity

High Probability Analysis for Non-Convex Stochastic Optimization with Clipping

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Accelerated Gradient-free Neural Network Training by Multi-convex Alternating Optimization

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Convex and Non-convex Optimization Under Generalized Smoothness

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

From Gradient Clipping to Normalization for Heavy Tailed SGD

Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise

On Faster Convergence of Scaled Sign Gradient Descent

Gradient Methods with Online Scaling