Machine learning in and out of equilibrium

Shishir Adhikari,Alkan Kabakçıoğlu,Alexander Strang,Deniz Yuret,Michael Hinczewski
2023-06-06
Abstract:The algorithms used to train neural networks, like stochastic gradient descent (SGD), have close parallels to natural processes that navigate a high-dimensional parameter space -- for example protein folding or evolution. Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels in a single, unified framework. We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium, exhibiting persistent currents in the space of network parameters. As in its physical analogues, the current is associated with an entropy production rate for any given training trajectory. The stationary distribution of these rates obeys the integral and detailed fluctuation theorems -- nonequilibrium generalizations of the second law of thermodynamics. We validate these relations in two numerical examples, a nonlinear regression network and MNIST digit classification. While the fluctuation theorems are universal, there are other aspects of the stationary state that are highly sensitive to the training details. Surprisingly, the effective loss landscape and diffusion matrix that determine the shape of the stationary distribution vary depending on the simple choice of minibatching done with or without replacement. We can take advantage of this nonequilibrium sensitivity to engineer an equilibrium stationary state for a particular application: sampling from a posterior distribution of network weights in Bayesian machine learning. We propose a new variation of stochastic gradient Langevin dynamics (SGLD) that harnesses without replacement minibatching. In an example system where the posterior is exactly known, this SGWORLD algorithm outperforms SGLD, converging to the posterior orders of magnitude faster as a function of the learning rate.
Machine Learning,Statistical Mechanics
What problem does this paper attempt to address?
The paper primarily explores the similarities between neural network training algorithms in machine learning and non-equilibrium processes in nature, and establishes a unified theoretical framework through the Fokker-Planck equation to study these similarities. Specifically, the paper attempts to address the following key questions: 1. **Understanding whether stochastic gradient descent (SGD), as the main algorithm for training neural networks, can reach some stable state after long-term training**. If so, is this stable state in thermodynamic equilibrium or non-equilibrium? 2. **Exploring the properties of the stationary state of SGD after long-term training**. In particular, focusing on the entropy production rate, probability flow, and similar properties in physical systems in this state. 3. **Investigating the impact of different mini-batch selection strategies (such as with replacement and without replacement) on the SGD training process**. Specifically, focusing on how these differences affect the final stable distribution and diffusion characteristics. 4. **Utilizing the non-equilibrium characteristics of SGD to design more efficient Bayesian machine learning algorithms**. For example, by adjusting the training process to better sample the posterior distribution of network weights. The paper uses the Fokker-Planck equation to describe the evolution of probability density during the SGD training process and analyzes the non-equilibrium characteristics of the SGD training process through this method. Additionally, a new algorithm (SGWORLD) is proposed, which leverages the characteristics of without replacement sampling to improve the efficiency of weight sampling in Bayesian machine learning.