Convergence and concentration properties of constant step-size SGD through Markov chains

Ibrahim Merad,Stéphane Gaïffas
2023-07-04
Abstract:We consider the optimization of a smooth and strongly convex objective using constant step-size stochastic gradient descent (SGD) and study its properties through the prism of Markov chains. We show that, for unbiased gradient estimates with mildly controlled variance, the iteration converges to an invariant distribution in total variation distance. We also establish this convergence in Wasserstein-2 distance in a more general setting compared to previous work. Thanks to the invariance property of the limit distribution, our analysis shows that the latter inherits sub-Gaussian or sub-exponential concentration properties when these hold true for the gradient. This allows the derivation of high-confidence bounds for the final estimate. Finally, under such conditions in the linear case, we obtain a dimension-free deviation bound for the Polyak-Ruppert average of a tail sequence. All our results are non-asymptotic and their consequences are discussed through a few applications.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to study the convergence and concentration properties through Markov chain theory when using constant - step - size stochastic gradient descent (SGD) for optimization. Specifically, the author considered the optimization problem of smooth and strongly convex objective functions and analyzed the iterative process of constant - step - size SGD and the properties of the estimators generated by it through the Markov chain method. The main contributions of the paper include: 1. **Convergence Results**: - Proved that the Markov chain converges to an invariant distribution under the total variation distance and the Wasserstein - 2 distance. Among them, the convergence result under the Wasserstein - 2 distance holds under more general conditions and is extended compared with previous literature. 2. **Concentration Properties**: - Showed that the sub - Gaussian or sub - exponential concentration properties of gradient samples can be transferred to the invariant distribution, thus obtaining the high - confidence bias bound of the final SGD iteration. - In the linear case, through stronger concentration assumptions, a high - confidence bias bound independent of the dimension was derived. 3. **Polyak - Ruppert Averaging**: - For the Polyak - Ruppert averaging of the tail sequence, a high - confidence bias bound independent of the dimension was obtained. This was achieved by applying a more general concentration result, which is applicable to any Lipschitz function applied to a stationary sequence. ### Formula Summary - **Objective Function**: \[ \min_{\theta \in \mathbb{R}^d} L(\theta) := \mathbb{E}_\zeta[\ell(\theta, \zeta)] \] - **SGD Iteration**: \[ \theta_{t + 1}=\theta_t-\beta G(\theta_t,\zeta_t),\quad t\geq0 \] - **Gradient Error Assumption**: \[ G(\theta,\zeta)=\nabla L(\theta)+\epsilon_\zeta(\theta) \] where \(\epsilon_\zeta(\theta)\) is the centered noise, satisfying: \[ \mathbb{E}[\epsilon_\zeta(\theta)\mid\theta]=0 \] and there exist positive numbers \(L_\sigma\) and \(\sigma^2\) such that: \[ \mathbb{E}[\|\epsilon_\zeta(\theta)\|^2\mid\theta]\leq L_\sigma\|\theta - \theta^\star\|^2+\sigma^2 \] - **Sub - Gaussian and Sub - Exponential Concentration Assumptions**: - Sub - Gaussian Concentration Assumption: \[ \|\epsilon_\zeta(\theta)\| \in \tilde{\Psi}_2(K) \] - Sub - Exponential Concentration Assumption: \[ \|\epsilon_\zeta(\theta)\| \in \tilde{\Psi}_1(K) \] - **Properties of the Invariant Distribution**: - Expectation: \[ \mathbb{E}_{\theta\sim\pi_\beta}[\nabla L(\theta)] = 0 \] - Variance and Bias Bound: \[ \text{Var}_{\pi_\beta}(\theta)\leq\frac{\beta\sigma^2}{2\mu-\beta(\mu^2 + L_\sigma)} \] \[ \|\bar{\theta}_\beta-\theta^\star\|\leq\sqrt{\