Abstract:In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.

What problem does this paper attempt to address?

The paper primarily explores a phenomenon of extreme dynamics that occurs when training deep neural networks using Stochastic Gradient Descent (SGD), known as anomalous diffusion. Specifically, the paper aims to address the following key issues: 1. **Understanding the phenomenon of anomalous diffusion**: Even after the performance of a deep neural network has converged, the network parameters continue to move in the parameter space. This movement exhibits anomalous diffusion behavior, where the distance of the parameters from the initialization point grows as a power law with the number of iterations. 2. **Revealing the dynamic mechanism**: Through theoretical analysis and empirical research, the paper aims to uncover the mechanisms driving this anomalous diffusion phenomenon. In particular, it explores how the complex interactions between optimizer hyperparameters, the structure of gradient noise, and the Hessian matrix at the end of training collectively lead to anomalous diffusion. 3. **Establishing a continuous-time model**: To better understand this extreme dynamics, the paper proposes a continuous-time model that models SGD as an underdamped Langevin equation with finite learning rate and batch size. This helps to analytically understand the dynamic characteristics of anomalous diffusion. 4. **Combining theory and practice**: The paper derives an exact analytical expression for the dynamics of anomalous diffusion in a simple setting of linear regression and uses tools from statistical physics to further explain these dynamic features. Additionally, it demonstrates that the extreme dynamics are actually driven by a modified loss function and probability flow. 5. **Empirical validation**: Finally, the paper experimentally validates these theoretical predictions by testing pre-trained deep neural networks (such as ResNet-18), showing consistency between theoretical predictions and actual observations, particularly on the ImageNet dataset. In summary, the goal of this paper is to deeply understand the phenomenon of anomalous diffusion that occurs during the training of deep neural networks and the mechanisms behind it through a combination of theoretical analysis and empirical research.

The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

Machine learning in and out of equilibrium

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Stochastic collapse: how gradient noise attracts SGD dynamics towards simpler subnetworks*

Does SGD really happen in tiny subspaces?

High-dimensional scaling limits and fluctuations of online least-squares SGD with smooth covariance

Phylogenetic relationships within the lizard clade Xantusiidae: using trees and divergence times to address evolutionary questions at multiple levels.

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

Revisiting the Noise Model of Stochastic Gradient Descent

Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

Singular-limit analysis of gradient descent with noise injection

Stochastic Gradient Descent and Anomaly of Variance-flatness Relation in Artificial Neural Networks

Three Factors Influencing Minima in SGD

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

The Regularization Effects of Anisotropic Noise in Stochastic Gradient Descent.

An Alternative View: When Does SGD Escape Local Minima?

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

A PDE-based Explanation of Extreme Numerical Sensitivities and Edge of Stability in Training Neural Networks