Abstract:Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

A Sharp Convergence Rate for the Asynchronous Stochastic Gradient Descent

Convergence Analysis of Asynchronous Stochastic Recursive Gradient Algorithms

AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms

Asynchronous Accelerated Stochastic Gradient Descent.

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach With Convergence Guarantee

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

A Sharp Estimate on the Transient Time of Distributed Stochastic Gradient Descent

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

Parallel Asynchronous Stochastic Variance Reduction for Nonconvex Optimization

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework

Asynchronous Decentralized Accelerated Stochastic Gradient Descent

Fast Asynchronous Parallel Stochastic Gradient Decent

Accelerated stochastic approximation with state-dependent noise

Asynchronous Stochastic Proximal Methods for Nonconvex Nonsmooth Optimization.

On the Convergence Properties of a K-step Averaging Stochastic Gradient Descent Algorithm for Nonconvex Optimization

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Asynchronous Stochastic Gradient Descent over Decentralized Datasets

Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators