Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Chaoyue Liu,Dmitriy Drusvyatskiy,Mikhail Belkin,Damek Davis,Yi-An Ma
2023-06-05
Abstract:Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
This paper attempts to address the convergence rate problem of Stochastic Gradient Descent (SGD) in the overparametrized problem. Specifically, the author hopes to find a condition under which SGD can converge rapidly within the interpolation region with the same worst - case iteration complexity as the deterministic gradient descent method, while only using a single sampled gradient (or a mini - batch gradient) per iteration. This will significantly narrow the gap between existing theories and practical applications, because existing guarantees usually require SGD to adopt a smaller step size, resulting in a slower linear convergence rate. ### Main contributions of the paper 1. **Introducing a new regularity condition**: The author proposes an "Aiming Condition" that enables SGD to converge at a fast linear rate of \( \mathcal{O}(\exp(-t / \kappa)) \) in expectation and with high probability, even if these conditions only hold locally. This condition ensures that the negative gradient direction points to the global minimum set \( S \), that is, \( -\nabla L(w) \) has a non - trivial correlation with the direction of \( \text{proj}_S(w)-w \). 2. **Proving the rationality of the condition**: The author proves that when training a sufficiently wide feed - forward neural network (with a linear output layer), the above - mentioned condition holds within any compact region. 3. **Improving the convergence rate of SGD**: By introducing the new condition, the author shows how to choose a larger step size in non - convex problems and make the convergence speed of SGD close to that of the deterministic gradient descent method. ### Summary of mathematical formulas - **Quadratic Growth Condition (QG)**: \[ L(w)\geq\frac{\alpha}{2}\cdot\text{dist}^2(w, S),\quad\forall w\in B_r(w_0) \] where \( B_r(w_0) \) is a ball centered at the initial point \( w_0 \) with a radius of \( r \). - **Aiming Condition (Aiming)**: \[ \langle\nabla L(w), w - \text{proj}_S(w)\rangle\geq\theta\cdot L(w),\quad\forall w\in B_r(w_0) \] where \( \text{proj}_S(w) \) represents the nearest point of \( w \) to the set \( S \). - **Theorem 1.1 (informal)**: \[ \text{Assume }L(w)=\mathbb{E}_{z\sim P}[\ell(w, z)],\text{ the loss function }\ell(·, z)\text{ is non - negative and the gradient is }\beta\text{-Lipschitz continuous.} \] \[ \text{If the minimum value of }L\text{ is zero and the regularity conditions (QG) and (Aiming) are satisfied, then as long as the SGD iterations remain within }B_r(w_0), \] \[ \text{they will converge to }S\text{ with high probability at a linear rate of }\mathcal{O}(\exp(-t\alpha\theta^2 / \beta)). \] ### Experimental verification The paper also experimentally verifies the effectiveness of the proposed conditions and theoretical results. In particular, when training a fully - connected neural network on the MNIST dataset, the performance of SGD is consistent with the theoretical prediction. ### Summary By introducing a new regularity condition, this paper successfully improves the convergence rate of SGD in the overparametrized problem and makes it closer to the performance in practical applications. This provides important theoretical support for understanding and optimizing the SGD algorithm in deep learning.