Abstract:Adam is shown not being able to converge to the optimal solution in certain cases. Researchers recently propose several algorithms to avoid the issue of non-convergence of Adam, but their efficiency turns out to be unsatisfactory in practice. In this paper, we provide new insight into the non-convergence issue of Adam as well as other adaptive learning rate methods. We argue that there exists an inappropriate correlation between gradient $g_t$ and the second-moment term $v_t$ in Adam ($t$ is the timestep), which results in that a large gradient is likely to have small step size while a small gradient may have a large step size. We demonstrate that such biased step sizes are the fundamental cause of non-convergence of Adam, and we further prove that decorrelating $v_t$ and $g_t$ will lead to unbiased step size for each gradient, thus solving the non-convergence problem of Adam. Finally, we propose AdaShift, a novel adaptive learning rate method that decorrelates $v_t$ and $g_t$ by temporal shifting, i.e., using temporally shifted gradient $g_{t-n}$ to calculate $v_t$. The experiment results demonstrate that AdaShift is able to address the non-convergence issue of Adam, while still maintaining a competitive performance with Adam in terms of both training speed and generalization.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that the Adam optimization algorithm fails to converge to the optimal solution in some cases. Specifically, there is an inappropriate positive correlation between the gradient \(g_t\) and the second - moment term \(v_t\) in the Adam algorithm, which leads to the fact that large gradients may have smaller step sizes, while small gradients may have larger step sizes. This deviated step size is the fundamental cause of the non - convergence of the Adam algorithm.
### Main contributions of the paper
1. **Problem analysis**:
- By analyzing the cumulative step size, the author found that in common adaptive learning rate methods, there is a deviation between the magnitude of the gradient \(g_t\) and the step size. Specifically, large gradients often correspond to smaller step sizes, while small gradients correspond to larger step sizes.
- This deviation stems from the inappropriate positive correlation between \(v_t\) and \(g_t\), that is, when the gradient is large, \(v_t\) is also large, resulting in a smaller step size; and vice versa.
2. **Solution**:
- The author proposed a new adaptive learning rate method called AdaShift, which breaks the correlation between \(v_t\) and \(g_t\) through temporal shifting. Specifically, the gradient \(g_{t - n}\) at time \(t - n\) is used to calculate \(v_t\), rather than directly using the current - time gradient \(g_t\).
- AdaShift not only solves the non - convergence problem of the Adam algorithm but also remains competitive in training speed and generalization performance.
### Markdown representation of formulas
- The update rules in the Adam algorithm are:
\[
m_t=\beta_1m_{t - 1}+(1-\beta_1)g_t
\]
\[
v_t=\beta_2v_{t - 1}+(1-\beta_2)g_t^2
\]
\[
\theta_{t + 1}=\theta_t-\frac{\alpha_t}{\sqrt{v_t}}m_t
\]
- The update rules in the AdaShift algorithm are:
\[
v_t=\beta_2v_{t - 1}+(1-\beta_2)\varphi(g_{t - n}^2)
\]
\[
m_t=\sum_{i = 0}^{n - 1}\beta_1^ig_{t - i}/\sum_{i = 0}^{n - 1}\beta_1^i
\]
\[
\theta_{t + 1}[i]=\theta_{t - 1}[i]-\frac{\alpha_t}{\sqrt{v_t[i]}}m_t[i]
\]
where \(\varphi\) is a spatial operation function used to process the spatial information of high - dimensional gradients.
### Summary
This paper proposes an innovative solution, AdaShift, through in - depth analysis of the non - convergence problem of the Adam algorithm. AdaShift breaks the correlation between the gradient and the second - moment through temporal shifting, thus solving the non - convergence problem of the Adam algorithm, and its effectiveness and superiority have been verified in multiple experiments.