Abstract:The algorithms used to train neural networks, like stochastic gradient descent (SGD), have close parallels to natural processes that navigate a high-dimensional parameter space -- for example protein folding or evolution. Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels in a single, unified framework. We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium, exhibiting persistent currents in the space of network parameters. As in its physical analogues, the current is associated with an entropy production rate for any given training trajectory. The stationary distribution of these rates obeys the integral and detailed fluctuation theorems -- nonequilibrium generalizations of the second law of thermodynamics. We validate these relations in two numerical examples, a nonlinear regression network and MNIST digit classification. While the fluctuation theorems are universal, there are other aspects of the stationary state that are highly sensitive to the training details. Surprisingly, the effective loss landscape and diffusion matrix that determine the shape of the stationary distribution vary depending on the simple choice of minibatching done with or without replacement. We can take advantage of this nonequilibrium sensitivity to engineer an equilibrium stationary state for a particular application: sampling from a posterior distribution of network weights in Bayesian machine learning. We propose a new variation of stochastic gradient Langevin dynamics (SGLD) that harnesses without replacement minibatching. In an example system where the posterior is exactly known, this SGWORLD algorithm outperforms SGLD, converging to the posterior orders of magnitude faster as a function of the learning rate.

Random Matrix Theory for Stochastic Gradient Descent

Stochastic weight matrix dynamics during learning and Dyson Brownian motion

Rigorous dynamical mean field theory for stochastic gradient descent methods

A Random Matrix Theory Approach to Damping in Deep Learning

Analysis of Stochastic Gradient Descent in Continuous Time

Dynamic of Stochastic Gradient Descent with State-Dependent Noise

Beyond the Edge of Stability via Two-step Gradient Updates

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

Random Function Descent

Online Stochastic Gradient Descent Learns Linear Dynamical Systems from A Single Trajectory

Machine learning in and out of equilibrium

Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Understanding the unstable convergence of gradient descent.

Revisiting the Characteristics of Stochastic Gradient Noise and Dynamics

Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent

On Markov Chain Gradient Descent

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

From Stability to Chaos: Analyzing Gradient Descent Dynamics in Quadratic Regression

Large-Scale Machine Learning with Stochastic Gradient Descent

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks