Abstract:The algorithms used to train neural networks, like stochastic gradient descent (SGD), have close parallels to natural processes that navigate a high-dimensional parameter space -- for example protein folding or evolution. Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels in a single, unified framework. We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium, exhibiting persistent currents in the space of network parameters. As in its physical analogues, the current is associated with an entropy production rate for any given training trajectory. The stationary distribution of these rates obeys the integral and detailed fluctuation theorems -- nonequilibrium generalizations of the second law of thermodynamics. We validate these relations in two numerical examples, a nonlinear regression network and MNIST digit classification. While the fluctuation theorems are universal, there are other aspects of the stationary state that are highly sensitive to the training details. Surprisingly, the effective loss landscape and diffusion matrix that determine the shape of the stationary distribution vary depending on the simple choice of minibatching done with or without replacement. We can take advantage of this nonequilibrium sensitivity to engineer an equilibrium stationary state for a particular application: sampling from a posterior distribution of network weights in Bayesian machine learning. We propose a new variation of stochastic gradient Langevin dynamics (SGLD) that harnesses without replacement minibatching. In an example system where the posterior is exactly known, this SGWORLD algorithm outperforms SGLD, converging to the posterior orders of magnitude faster as a function of the learning rate.

Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Lévy Gradient Descent: Augmented Random Search for Geophysical Inverse Problems

Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape

Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials

Stochastic Gradient and Langevin Processes

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

Stochastic Gradient Descent Introduces an Effective Landscape-Dependent Regularization Favoring Flat Solutions

From Stability to Chaos: Analyzing Gradient Descent Dynamics in Quadratic Regression

Beyond the Edge of Stability via Two-step Gradient Updates

Generalization Bounds for Gradient Methods via Discrete and Continuous Prior

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

From Zero to Hero: How local curvature at artless initial conditions leads away from bad minima

Gradient is All You Need?

Adaptive Non-reversible Stochastic Gradient Langevin Dynamics

Langevin algorithms for Markovian Neural Networks and Deep Stochastic control

Machine learning in and out of equilibrium

Exact Langevin Dynamics with Stochastic Gradients

Global convergence of gradient descent for phase retrieval

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution