Abstract:Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of non-convergence of Adam, but their efficiency turns out to be unsatisfactory in practice. In this paper, we provide new insight into the non-convergence issue of Adam as well as other adaptive learning rate methods. We argue that there exists an inappropriate correlation between gradient $g_t$ and the second-moment term $v_t$ in Adam ($t$ is the timestep), which results in that a large gradient is likely to have small step size while a small gradient may have a large step size. We demonstrate that such biased step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating $v_t$ and $g_t$ will lead to unbiased step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates $v_t$ and $g_t$ by temporal shifting, i.e., using temporally shifted gradient $g_{t-n}$ to calculate $v_t$. The experiment results demonstrate that AdaShift is able to address the non-convergence issue of Adam, while still maintaining a competitive performance with Adam in terms of both training speed and generalization.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that the Adam optimization algorithm fails to converge to the optimal solution in some cases. Specifically, there is an inappropriate positive correlation between the gradient $g_t$ and the second - moment term $v_t$ in the Adam algorithm, which leads to the fact that large gradients may have smaller step sizes, while small gradients may have larger step sizes. This deviated step size is the fundamental cause of the non - convergence of the Adam algorithm. ### Main contributions of the paper 1. **Problem analysis**: - By analyzing the cumulative step size, the author found that in common adaptive learning rate methods, there is a deviation between the magnitude of the gradient $g_t$ and the step size. Specifically, large gradients often correspond to smaller step sizes, while small gradients correspond to larger step sizes. - This deviation stems from the inappropriate positive correlation between $v_t$ and $g_t$, that is, when the gradient is large, $v_t$ is also large, resulting in a smaller step size; and vice versa. 2. **Solution**: - The author proposed a new adaptive learning rate method called AdaShift, which breaks the correlation between $v_t$ and $g_t$ through temporal shifting. Specifically, the gradient $g_{t - n}$ at time $t - n$ is used to calculate $v_t$, rather than directly using the current - time gradient $g_t$. - AdaShift not only solves the non - convergence problem of the Adam algorithm but also remains competitive in training speed and generalization performance. ### Markdown representation of formulas - The update rules in the Adam algorithm are: \[ m_t=\beta_1m_{t - 1}+(1-\beta_1)g_t \] \[ v_t=\beta_2v_{t - 1}+(1-\beta_2)g_t^2 \] \[ \theta_{t + 1}=\theta_t-\frac{\alpha_t}{\sqrt{v_t}}m_t \] - The update rules in the AdaShift algorithm are: \[ v_t=\beta_2v_{t - 1}+(1-\beta_2)\varphi(g_{t - n}^2) \] \[ m_t=\sum_{i = 0}^{n - 1}\beta_1^ig_{t - i}/\sum_{i = 0}^{n - 1}\beta_1^i \] \[ \theta_{t + 1}[i]=\theta_{t - 1}[i]-\frac{\alpha_t}{\sqrt{v_t[i]}}m_t[i] \] where $\varphi$ is a spatial operation function used to process the spatial information of high - dimensional gradients. ### Summary This paper proposes an innovative solution, AdaShift, through in - depth analysis of the non - convergence problem of the Adam algorithm. AdaShift breaks the correlation between the gradient and the second - moment through temporal shifting, thus solving the non - convergence problem of the Adam algorithm, and its effectiveness and superiority have been verified in multiple experiments.

AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Adaptive Learning Rates with Maximum Variation Averaging.

A New Adaptive Gradient Method with Gradient Decomposition

AdaDB: an Adaptive Gradient Method with Data-Dependent Bound.

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

An Adaptive and Momental Bound Method for Stochastic Learning

Nostalgic Adam: Weighting More of the Past Gradients when Designing the Adaptive Learning Rate

A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with SGD

A modification of adaptive moment estimation (adam) for machine learning

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

On the Convergence of Decentralized Adaptive Gradient Methods

High Probability Convergence of Adam Under Unbounded Gradients and Affine Variance Noise

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients

Adam$^+$: A Stochastic Method with Adaptive Variance Reduction

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive Gradient Methods Can Be Provably Faster Than SGD after Finite Epochs