Abstract: It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones~\cite{keskar2016large,he2019asymmetric}, our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

Understanding the Generalization Benefits of Late Learning Rate Decay

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

How Does Learning Rate Decay Help Modern Neural Networks?

Advancing neural network calibration: The role of gradient decay in large-margin Softmax optimization

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes.

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning

Understanding Why Neural Networks Generalize Well Through GSNR of Parameters

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning.

Large Learning Rates Improve Generalization: But How Large Are We Talking About?

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning

Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

Learning Stages: Phenomenon, Root Cause, Mechanism Hypothesis, and Implications.

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Generalization by design: Shortcuts to Generalization in Deep Learning

Why Does Sharpness-Aware Minimization Generalize Better Than SGD?

A new perspective for understanding generalization gap of deep neural networks trained with large batch sizes

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

When and how epochwise double descent happens

Towards Understanding Generalization Via Decomposing Excess Risk Dynamics