Abstract:Adam-type algorithms have become a preferred choice for optimisation in the deep learning setting, however, despite success, their convergence is still not well understood. To this end, we introduce a unified framework for Adam-type algorithms (called UAdam). This is equipped with a general form of the second-order moment, which makes it possible to include Adam and its variants as special cases, such as NAdam, AMSGrad, AdaBound, AdaFom, and Adan. This is supported by a rigorous convergence analysis of UAdam in the non-convex stochastic setting, showing that UAdam converges to the neighborhood of stationary points with the rate of $\mathcal{O}(1/T)$. Furthermore, the size of neighborhood decreases as $\beta$ increases. Importantly, our analysis only requires the first-order momentum factor to be close enough to 1, without any restrictions on the second-order momentum factor. Theoretical results also show that vanilla Adam can converge by selecting appropriate hyperparameters, which provides a theoretical guarantee for the analysis, applications, and further developments of the whole class of Adam-type algorithms.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is related to the convergence of Adam and its variant algorithms in non - convex stochastic optimization. Although the Adam algorithm has achieved remarkable success in deep learning, its convergence is still not fully understood. Specifically: 1. **Convergence problem of Adam algorithm**: Although the Adam algorithm performs well in practice, Reddi et al. pointed out that Adam may diverge on simple convex problems. Therefore, researchers have proposed many variants of Adam (such as AMSGrad, AdaBound, etc.). These variants only differ in the second - order moment, but lack a unified theoretical framework to explain their convergence. 2. **Limitations of existing analysis**: Most previous work mainly focused on theoretical analysis in the online convex setting and could not explain the convergence behavior in the non - convex setting common in practical applications. In addition, many analyses have strict requirements for the second - order momentum parameter $\beta_2$, which is inconsistent with the hyperparameter settings in practical applications. To solve these problems, this paper proposes a unified Adam - type algorithm framework (UAdam), aiming at: - **Providing a general framework**: UAdam can include Adam and its various variants as special cases, thereby providing a unified theoretical analysis platform for these algorithms. - **Relaxing the restrictions on the second - order momentum parameter**: The paper proves that UAdam can converge without imposing any restrictions on the second - order momentum parameter $\beta_2$, as long as the first - order momentum parameter $\beta_1$ is close to 1. - **Proving the convergence rate**: The paper proves that UAdam converges to the neighborhood of the stable point at a rate of $O(1/T)$ in the non - convex stochastic optimization setting, and as $\beta$ increases, the size of the neighborhood will decrease. Through these contributions, this paper not only provides a new perspective for understanding the convergence of Adam and its variants, but also provides a theoretical basis for the development of new optimization algorithms in the future.

UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization

UAdam: Unified Adam-Type Algorithmic Framework for Nonconvex Optimization

Adam: A Method for Stochastic Optimization

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD

Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters and Non-ergodic Case

Convergence rates for the Adam optimizer

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

A modification of adaptive moment estimation (adam) for machine learning

Divergence Results and Convergence of a Variance Reduced Version of ADAM

Adam-family Methods for Nonsmooth Optimization with Convergence Guarantees

On the Convergence of Adam under Non-uniform Smoothness: Separability from SGDM and Beyond

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

A Unified Analysis of AdaGrad With Weighted Aggregation and Momentum Acceleration

SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients

Adam-family Methods with Decoupled Weight Decay in Deep Learning

Provable Adaptivity of Adam under Non-uniform Smoothness

AMAdam: adaptive modifier of Adam method