Abstract:Stochastic gradient descent (SGD) is a popular algorithm for minimizing objective functions that arise in machine learning. For constant step-sized SGD, the iterates form a Markov chain on a general state space. Focusing on a class of separable (non-convex) objective functions, we establish a "Doeblin-type decomposition," in that the state space decomposes into a uniformly transient set and a disjoint union of absorbing sets. Each of the absorbing sets contains a unique invariant measure, with the set of all invariant measures being the convex hull. Moreover the set of invariant measures are shown to be global attractors to the Markov chain with a geometric convergence rate. The theory is highlighted with examples that show: (1) the failure of the diffusion approximation to characterize the long-time dynamics of SGD; (2) the global minimum of an objective function may lie outside the support of the invariant measures (i.e., even if initialized at the global minimum, SGD iterates will leave); and (3) bifurcations may enable the SGD iterates to transition between two local minima. Key ingredients in the theory involve viewing the SGD dynamics as a monotone iterated function system and establishing a "splitting condition" of Dubins and Freedman 1966 and Bhattacharya and Lee 1988.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is about the convergence problem of the constant - step - size Stochastic Gradient Descent (SGD) algorithm when dealing with separable non - convex objective functions. Specifically, the paper focuses on the convergence properties of the Markov chain generated by the SGD algorithm when a constant step - size is used. The authors pay special attention to the following points: 1. **State Space Decomposition**: The paper establishes the "Doeblin - type decomposition", that is, the state space can be decomposed into the disjoint union of a uniformly transient set and several absorbing sets. Each absorbing set contains a unique invariant measure, and the set of these invariant measures forms a convex hull. 2. **Global Attractiveness of Invariant Measures**: The paper proves that these invariant measures are global attractors of the Markov chain and have a geometric convergence rate. 3. **Limitations of Diffusion Approximation**: The paper shows through examples that the diffusion approximation cannot accurately describe the long - time dynamic behavior of SGD. For example, even if initialized from the global minimum, the SGD iteration may leave this point. 4. **Transitions between Local Minima**: The paper analyzes the phenomenon that under certain conditions, the SGD iteration may transition between two local minima. ### Formulas and Concepts - **Objective Function**: \[ F(x)=\frac{1}{n} \sum_{i = 1}^{n} f_i(x) \] where \(f_i:\mathbb{R}^d\rightarrow\mathbb{R}\) are separable non - convex functions. - **SGD Update Rule**: \[ \phi_i(x):=x-\eta\nabla f_i(x)\quad(1\leq i\leq n) \] \[ X_{k + 1}=\phi_{i_k}(X_k)\quad\text{where}\quad i_k\in\{1,2,\ldots,n\} \] - **Transition Kernel**: \[ p(x,A)=\frac{1}{n} \sum_{i = 1}^{n}\chi_A(\phi_i(x)) \] where \(\chi_A\) is the characteristic function of the set \(A\). - **Markov Operator**: \[ (P\mu)(A)=\int_{\mathbb{R}^d}p(x,A)d\mu(x) \] - **Invariant Measure**: \[ P\mu^{\star}=\mu^{\star} \] ### Main Results - **State Space Decomposition**: \[ I = B\cup\bigcup_{m\in M}T_m \] where \(B\) is a uniformly transient set, \(T_m\) are absorbing sets, and each \(T_m\) contains at least one local minimum. - **Uniqueness and Geometric Convergence of Invariant Measures**: \[ d_{\alpha_m}(P^k\mu,\mu^{\star}_m)\leq\left(1-\frac{1}{n\ell_m}\right)^{\lfloor k / \ell_m\rfloor}\quad\text{for any}\ \mu\ \text{supported on}\ T_m \] - **Global Convergence**: \[ \tilde{d}(\mu_k,\mu^{\star})\leq3\left(1-\frac{1}{n\ell}\right)^{\lfloor k / \ell\rfloor} \] where \(\mu^{\star}\) is an invariant measure of the form \(\sum_{m\in M}c_m\mu^{\star}_m\) and \(\sum_{m\in M}c_m = 1\). ### Important Conclusions - **Doeblin - type Decomposition**: The state space can be decomposed into a uniformly transient set and absorbing sets.

Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

Convergence and concentration properties of constant step-size SGD through Markov chains

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

Coupling-based Convergence Diagnostic and Stepsize Scheme for Stochastic Gradient Descent

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators

On Markov Chain Gradient Descent

Stochastic Differential Equations models for Least-Squares Stochastic Gradient Descent

Stochastic Gradient Descent as Approximate Bayesian Inference

On the Diffusion Approximation of Nonconvex Stochastic Gradient Descent

The Anytime Convergence of Stochastic Gradient Descent with Momentum: From a Continuous-Time Perspective

An Alternative View: When Does SGD Escape Local Minima?

Demystifying the Myths and Legends of Nonconvex Convergence of SGD

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on GLMs and multi-index models

Understanding the unstable convergence of gradient descent.

Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem

Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements

On the Unstable Convergence Regime of Gradient Descent

Analysis of Stochastic Gradient Descent in Continuous Time