Stability and convergence analysis of AdaGrad for non-convex optimization via novel stopping time-based techniques

Ruinan Jin,Xiaoyu Wang,Baoxiang Wang
2024-12-29
Abstract:Adaptive gradient optimizers (AdaGrad), which dynamically adjust the learning rate based on iterative gradients, have emerged as powerful tools in deep learning. These adaptive methods have significantly succeeded in various deep learning tasks, outperforming stochastic gradient descent. However, despite AdaGrad's status as a cornerstone of adaptive optimization, its theoretical analysis has not adequately addressed key aspects such as asymptotic convergence and non-asymptotic convergence rates in non-convex optimization scenarios. This study aims to provide a comprehensive analysis of AdaGrad and bridge the existing gaps in the literature. We introduce a new stopping time technique from probability theory, which allows us to establish the stability of AdaGrad under mild conditions. We further derive the asymptotically almost sure and mean-square convergence for AdaGrad. In addition, we demonstrate the near-optimal non-asymptotic convergence rate measured by the average-squared gradients in expectation, which is stronger than the existing high-probability results. The techniques developed in this work are potentially of independent interest for future research on other adaptive stochastic algorithms.
Optimization and Control,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of insufficient theoretical analysis of the AdaGrad optimization algorithm in non - convex optimization problems, especially in terms of asymptotic convergence and non - asymptotic convergence rates. Specifically: 1. **Asymptotic convergence**: - **Almost Sure Convergence**: Prove that AdaGrad almost surely converges to the critical point in non - convex optimization, that is, \(\lim_{n \to \infty} \|\nabla g(\theta_n)\| = 0\) holds almost everywhere. - **Mean - Square Convergence**: Prove that the expected value of the squared gradient norm of AdaGrad approaches zero, that is, \(\lim_{n \to \infty} E[\|\nabla g(\theta_n)\|^2] = 0\). 2. **Non - asymptotic convergence rate**: - Provide a non - asymptotic convergence rate of the expected value of the average squared gradient without relying on the assumption of uniform boundedness of the stochastic gradient. Specifically, the author shows that under certain conditions, the non - asymptotic convergence rate of AdaGrad is \(O\left(\frac{\ln T}{\sqrt{T}}\right)\). 3. **Stability**: - Introduce a new stopping - time technique to prove that AdaGrad is stable under mild conditions, that is, the essential supremum of the expected value of the loss function is bounded: \(E\left[\sup_{n \geq 1} g(\theta_n)\right] < \tilde{M} < +\infty\). 4. **Improve existing work**: - The paper addresses some limitations in the existing literature. For example, the analysis in Jin et al. [2022] depends on the unrealistic no - saddle - point assumption, and the results of Li and Orabona [2019] are only applicable to the modified AdaGrad variant. This paper analyzes the original AdaGrad and does not require these strong assumptions. Through these analyses, the paper provides a more comprehensive and rigorous theoretical guarantee for the performance of AdaGrad in non - convex optimization, filling the gaps in the existing literature. In addition, the techniques and methods proposed in the paper also have independent research value and can be applied to the research of other adaptive stochastic optimization algorithms.