Abstract:We consider the optimization of a smooth and strongly convex objective using constant step-size stochastic gradient descent (SGD) and study its properties through the prism of Markov chains. We show that, for unbiased gradient estimates with mildly controlled variance, the iteration converges to an invariant distribution in total variation distance. We also establish this convergence in Wasserstein-2 distance in a more general setting compared to previous work. Thanks to the invariance property of the limit distribution, our analysis shows that the latter inherits sub-Gaussian or sub-exponential concentration properties when these hold true for the gradient. This allows the derivation of high-confidence bounds for the final estimate. Finally, under such conditions in the linear case, we obtain a dimension-free deviation bound for the Polyak-Ruppert average of a tail sequence. All our results are non-asymptotic and their consequences are discussed through a few applications.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to study the convergence and concentration properties through Markov chain theory when using constant - step - size stochastic gradient descent (SGD) for optimization. Specifically, the author considered the optimization problem of smooth and strongly convex objective functions and analyzed the iterative process of constant - step - size SGD and the properties of the estimators generated by it through the Markov chain method. The main contributions of the paper include: 1. **Convergence Results**: - Proved that the Markov chain converges to an invariant distribution under the total variation distance and the Wasserstein - 2 distance. Among them, the convergence result under the Wasserstein - 2 distance holds under more general conditions and is extended compared with previous literature. 2. **Concentration Properties**: - Showed that the sub - Gaussian or sub - exponential concentration properties of gradient samples can be transferred to the invariant distribution, thus obtaining the high - confidence bias bound of the final SGD iteration. - In the linear case, through stronger concentration assumptions, a high - confidence bias bound independent of the dimension was derived. 3. **Polyak - Ruppert Averaging**: - For the Polyak - Ruppert averaging of the tail sequence, a high - confidence bias bound independent of the dimension was obtained. This was achieved by applying a more general concentration result, which is applicable to any Lipschitz function applied to a stationary sequence. ### Formula Summary - **Objective Function**: \[ \min_{\theta \in \mathbb{R}^d} L(\theta) := \mathbb{E}_\zeta[\ell(\theta, \zeta)] \] - **SGD Iteration**: \[ \theta_{t + 1}=\theta_t-\beta G(\theta_t,\zeta_t),\quad t\geq0 \] - **Gradient Error Assumption**: \[ G(\theta,\zeta)=\nabla L(\theta)+\epsilon_\zeta(\theta) \] where \(\epsilon_\zeta(\theta)\) is the centered noise, satisfying: \[ \mathbb{E}[\epsilon_\zeta(\theta)\mid\theta]=0 \] and there exist positive numbers \(L_\sigma\) and \(\sigma^2\) such that: \[ \mathbb{E}[\|\epsilon_\zeta(\theta)\|^2\mid\theta]\leq L_\sigma\|\theta - \theta^\star\|^2+\sigma^2 \] - **Sub - Gaussian and Sub - Exponential Concentration Assumptions**: - Sub - Gaussian Concentration Assumption: \[ \|\epsilon_\zeta(\theta)\| \in \tilde{\Psi}_2(K) \] - Sub - Exponential Concentration Assumption: \[ \|\epsilon_\zeta(\theta)\| \in \tilde{\Psi}_1(K) \] - **Properties of the Invariant Distribution**: - Expectation: \[ \mathbb{E}_{\theta\sim\pi_\beta}[\nabla L(\theta)] = 0 \] - Variance and Bias Bound: \[ \text{Var}_{\pi_\beta}(\theta)\leq\frac{\beta\sigma^2}{2\mu-\beta(\mu^2 + L_\sigma)} \] \[ \|\bar{\theta}_\beta-\theta^\star\|\leq\sqrt{\

Convergence and concentration properties of constant step-size SGD through Markov chains

Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

Convergence Analysis of Stochastic Gradient Descent with MCMC Estimators

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

Demystifying the Myths and Legends of Nonconvex Convergence of SGD

Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization

Convergence Analysis of Accelerated Stochastic Gradient Descent under the Growth Condition

Stochastic Gradient Descent Revisited

Convergence in High Probability of Distributed Stochastic Gradient Descent Algorithms

Tight Nonparametric Convergence Rates for Stochastic Gradient Descent under the Noiseless Linear Model

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

The Anytime Convergence of Stochastic Gradient Descent with Momentum: From a Continuous-Time Perspective

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem

High-probability Convergence Bounds for Nonlinear Stochastic Gradient Descent Under Heavy-tailed Noise

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

A Tight Convergence Analysis for Stochastic Gradient Descent with Delayed Updates

Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation

Linear convergence of decentralized estimation for statistical estimation using gradient method