Abstract: It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones~\cite{keskar2016large,he2019asymmetric}, our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.

The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent.

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

Beyond the Edge of Stability via Two-step Gradient Updates

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Characterizing Dynamical Stability of Stochastic Gradient Descent in Overparameterized Learning

A Precise Characterization of SGD Stability Using Loss Surface Geometry

How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective

Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability

The alignment property of SGD noise and how it helps select flat minima: A stability analysis

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Exact Mean Square Linear Stability Analysis for SGD

Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent

Improved Stability and Generalization Guarantees of the Decentralized SGD Algorithm

Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult

Generalization Error Bounds for Optimization Algorithms Via Stability

Stability and Generalization of the Decentralized Stochastic Gradient Descent Ascent Algorithm

Stability-Based Generalization Analysis of the Asynchronous Decentralized SGD

Stability Based Generalization Bounds for Exponential Family Langevin Dynamics

Towards Theoretically Understanding Why Sgd Generalizes Better Than Adam in Deep Learning