Abstract:Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) under favorable geometry for stochastic convex optimization, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an upper bound on the convergence rate of AdaGrad and a corresponding lower bound for SGD. In particular, for certain configurations of problem parameters, we show that the iteration complexity of AdaGrad outperforms SGD by a factor of $d$. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

Convergence guarantees for forward gradient descent in the linear regression model

Convergence guarantees for forward gradient descent in the linear regression model

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear target functions

Improving the Convergence Rates of Forward Gradient Descent with Repeated Sampling

On the Convergence of Gradient Descent for Large Learning Rates

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

Convergence proof for stochastic gradient descent in the training of deep neural networks with ReLU activation for constant target functions

Linear Convergence Rate in Convex Setup is Possible! Gradient Descent Method Variants under $(L_0,L_1)$-Smoothness

Convergence Analysis of Gradient Algorithms on Riemannian Manifolds Without Curvature Constraints and Application to Riemannian Mass

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

Linear convergence of forward-backward accelerated algorithms without knowledge of the modulus of strong convexity

A Geometric Approach of Gradient Descent Algorithms in Linear Neural Networks

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Linear Convergence of Adaptive Stochastic Gradient Descent

Linear Convergence of Stochastic Iterative Greedy Algorithms With Sparse Constraints

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Convergence of online gradient method for feedforward neural networks with smoothing L 1/2 regularization penalty

Convergence of Batch Gradient Learning with Smoothing Regularization and Adaptive Momentum for Neural Networks