Abstract:Adaptive gradient optimizers (AdaGrad), which dynamically adjust the learning rate based on iterative gradients, have emerged as powerful tools in deep learning. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent. However, despite AdaGrad's status as a cornerstone of adaptive optimization, its theoretical analysis has not adequately addressed key aspects such as asymptotic convergence and non-asymptotic convergence rates in non-convex optimization scenarios. This study aims to provide a comprehensive analysis of AdaGrad and bridge the existing gaps in the literature. We introduce a new stopping time technique from probability theory, which allows us to establish the stability of AdaGrad under mild conditions. We further derive the asymptotically almost sure and mean-square convergence for AdaGrad. In addition, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is stronger than the existing high-probability results. The techniques developed in this work are potentially of independent interest for future research on other adaptive stochastic algorithms.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient theoretical analysis of the AdaGrad optimization algorithm in non - convex optimization problems, especially in terms of asymptotic convergence and non - asymptotic convergence rates. Specifically: 1. **Asymptotic convergence**: - **Almost Sure Convergence**: Prove that AdaGrad almost surely converges to the critical point in non - convex optimization, that is, \(\lim_{n \to \infty} \|\nabla g(\theta_n)\| = 0\) holds almost everywhere. - **Mean - Square Convergence**: Prove that the expected value of the squared gradient norm of AdaGrad approaches zero, that is, \(\lim_{n \to \infty} E[\|\nabla g(\theta_n)\|^2] = 0\). 2. **Non - asymptotic convergence rate**: - Provide a non - asymptotic convergence rate of the expected value of the average squared gradient without relying on the assumption of uniform boundedness of the stochastic gradient. Specifically, the author shows that under certain conditions, the non - asymptotic convergence rate of AdaGrad is \(O\left(\frac{\ln T}{\sqrt{T}}\right)\). 3. **Stability**: - Introduce a new stopping - time technique to prove that AdaGrad is stable under mild conditions, that is, the essential supremum of the expected value of the loss function is bounded: \(E\left[\sup_{n \geq 1} g(\theta_n)\right] < \tilde{M} < +\infty\). 4. **Improve existing work**: - The paper addresses some limitations in the existing literature. For example, the analysis in Jin et al. [2022] depends on the unrealistic no - saddle - point assumption, and the results of Li and Orabona [2019] are only applicable to the modified AdaGrad variant. This paper analyzes the original AdaGrad and does not require these strong assumptions. Through these analyses, the paper provides a more comprehensive and rigorous theoretical guarantee for the performance of AdaGrad in non - convex optimization, filling the gaps in the existing literature. In addition, the techniques and methods proposed in the paper also have independent research value and can be applied to the research of other adaptive stochastic optimization algorithms.

Stability and convergence analysis of AdaGrad for non-convex optimization via novel stopping time-based techniques

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration

Revisiting Convergence of AdaGrad with Relaxed Assumptions

Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions

Linear Convergence of Adaptive Stochastic Gradient Descent

A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Convergence Analysis of Asynchronous Stochastic Recursive Gradient Algorithms

Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions

Adaptive Gradient Methods Can Be Provably Faster Than SGD after Finite Epochs

A Unified Analysis of AdaGrad With Weighted Aggregation and Momentum Acceleration

AdaGrad under Anisotropic Smoothness

Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction

Convergence rates for the Adam optimizer

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

Understanding the unstable convergence of gradient descent.

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions