The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations, and Anomalous Diffusion

Daniel Kunin,Javier Sagastuy-Brena,Lauren Gillespie,Eshed Margalit,Hidenori Tanaka,Surya Ganguli,Daniel L. K. Yamins
DOI: https://doi.org/10.1162/neco_a_01626
2023-12-29
Abstract:In this work we explore the limiting dynamics of deep neural networks trained with stochastic gradient descent (SGD). As observed previously, long after performance has converged, networks continue to move through parameter space by a process of anomalous diffusion in which distance travelled grows as a power law in the number of gradient updates with a nontrivial exponent. We reveal an intricate interaction between the hyperparameters of optimization, the structure in the gradient noise, and the Hessian matrix at the end of training that explains this anomalous diffusion. To build this understanding, we first derive a continuous-time model for SGD with finite learning rates and batch sizes as an underdamped Langevin equation. We study this equation in the setting of linear regression, where we can derive exact, analytic expressions for the phase space dynamics of the parameters and their instantaneous velocities from initialization to stationarity. Using the Fokker-Planck equation, we show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity, and probability currents, which cause oscillations in phase space. We identify qualitative and quantitative predictions of this theory in the dynamics of a ResNet-18 model trained on ImageNet. Through the lens of statistical physics, we uncover a mechanistic origin for the anomalous limiting dynamics of deep neural networks trained with SGD.
Machine Learning,Statistical Mechanics,Neurons and Cognition
What problem does this paper attempt to address?
The paper primarily explores a phenomenon of extreme dynamics that occurs when training deep neural networks using Stochastic Gradient Descent (SGD), known as anomalous diffusion. Specifically, the paper aims to address the following key issues: 1. **Understanding the phenomenon of anomalous diffusion**: Even after the performance of a deep neural network has converged, the network parameters continue to move in the parameter space. This movement exhibits anomalous diffusion behavior, where the distance of the parameters from the initialization point grows as a power law with the number of iterations. 2. **Revealing the dynamic mechanism**: Through theoretical analysis and empirical research, the paper aims to uncover the mechanisms driving this anomalous diffusion phenomenon. In particular, it explores how the complex interactions between optimizer hyperparameters, the structure of gradient noise, and the Hessian matrix at the end of training collectively lead to anomalous diffusion. 3. **Establishing a continuous-time model**: To better understand this extreme dynamics, the paper proposes a continuous-time model that models SGD as an underdamped Langevin equation with finite learning rate and batch size. This helps to analytically understand the dynamic characteristics of anomalous diffusion. 4. **Combining theory and practice**: The paper derives an exact analytical expression for the dynamics of anomalous diffusion in a simple setting of linear regression and uses tools from statistical physics to further explain these dynamic features. Additionally, it demonstrates that the extreme dynamics are actually driven by a modified loss function and probability flow. 5. **Empirical validation**: Finally, the paper experimentally validates these theoretical predictions by testing pre-trained deep neural networks (such as ResNet-18), showing consistency between theoretical predictions and actual observations, particularly on the ImageNet dataset. In summary, the goal of this paper is to deeply understand the phenomenon of anomalous diffusion that occurs during the training of deep neural networks and the mechanisms behind it through a combination of theoretical analysis and empirical research.