Momentum is All You Need for Data-Driven Adaptive Optimization

Yizhou Wang,Yue Kang,Can Qin,Huan Wang,Yi Xu,Yulun Zhang,Yun Fu
DOI: https://doi.org/10.1109/icdm58522.2023.00179
2023-01-01
Abstract:Adaptive gradient methods, e.g., ADAM, have achieved tremendous success in data-driven machine learning, especially deep learning. Employing adaptive learning rates according to the gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization capacity compared with stochastic gradient descent (SGD) and tend to be trapped in local minima at an early stage during the training process. Intriguingly, we discover that the issue can be resolved by substituting the gradient in the second raw moment estimate term with its exponential moving average version in ADAM. The intuition is that the gradient with momentum contains more accurate directional information, and therefore its second-moment estimation is a more preferable option for learning rate scaling than that of the raw gradient. Thereby we propose ADAM$^{3}$ as a new optimizer reaching the goal of training quickly while generalizing much better. Extensive experiments on a variety of tasks and models demonstrate that ADAM$^{3}$ exhibits state-of-the-art performance and superior training stability consistently. Considering the simplicity and effectiveness of ADAM$^{3}$, we believe it has the potential to become a new standard method in deep learning. Code is provided at https://github.com/wyzjack/AdaM3.
What problem does this paper attempt to address?