Abstract:Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.

What problem does this paper attempt to address?

This paper attempts to address the convergence rate problem of Stochastic Gradient Descent (SGD) in the overparametrized problem. Specifically, the author hopes to find a condition under which SGD can converge rapidly within the interpolation region with the same worst - case iteration complexity as the deterministic gradient descent method, while only using a single sampled gradient (or a mini - batch gradient) per iteration. This will significantly narrow the gap between existing theories and practical applications, because existing guarantees usually require SGD to adopt a smaller step size, resulting in a slower linear convergence rate. ### Main contributions of the paper 1. **Introducing a new regularity condition**: The author proposes an "Aiming Condition" that enables SGD to converge at a fast linear rate of \( \mathcal{O}(\exp(-t / \kappa)) \) in expectation and with high probability, even if these conditions only hold locally. This condition ensures that the negative gradient direction points to the global minimum set \( S \), that is, \( -\nabla L(w) \) has a non - trivial correlation with the direction of \( \text{proj}_S(w)-w \). 2. **Proving the rationality of the condition**: The author proves that when training a sufficiently wide feed - forward neural network (with a linear output layer), the above - mentioned condition holds within any compact region. 3. **Improving the convergence rate of SGD**: By introducing the new condition, the author shows how to choose a larger step size in non - convex problems and make the convergence speed of SGD close to that of the deterministic gradient descent method. ### Summary of mathematical formulas - **Quadratic Growth Condition (QG)**: \[ L(w)\geq\frac{\alpha}{2}\cdot\text{dist}^2(w, S),\quad\forall w\in B_r(w_0) \] where \( B_r(w_0) \) is a ball centered at the initial point \( w_0 \) with a radius of \( r \). - **Aiming Condition (Aiming)**: \[ \langle\nabla L(w), w - \text{proj}_S(w)\rangle\geq\theta\cdot L(w),\quad\forall w\in B_r(w_0) \] where \( \text{proj}_S(w) \) represents the nearest point of \( w \) to the set \( S \). - **Theorem 1.1 (informal)**: \[ \text{Assume }L(w)=\mathbb{E}_{z\sim P}[\ell(w, z)],\text{ the loss function }\ell(·, z)\text{ is non - negative and the gradient is }\beta\text{-Lipschitz continuous.} \] \[ \text{If the minimum value of }L\text{ is zero and the regularity conditions (QG) and (Aiming) are satisfied, then as long as the SGD iterations remain within }B_r(w_0), \] \[ \text{they will converge to }S\text{ with high probability at a linear rate of }\mathcal{O}(\exp(-t\alpha\theta^2 / \beta)). \] ### Experimental verification The paper also experimentally verifies the effectiveness of the proposed conditions and theoretical results. In particular, when training a fully - connected neural network on the MNIST dataset, the performance of SGD is consistent with the theoretical prediction. ### Summary By introducing a new regularity condition, this paper successfully improves the convergence rate of SGD in the overparametrized problem and makes it closer to the performance in practical applications. This provides important theoretical support for understanding and optimizing the SGD algorithm in deep learning.

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

Demystifying SGD with Doubly Stochastic Gradients

Faster Convergence of Local SGD for Over-Parameterized Models

Shuffling Gradient Descent-Ascent with Variance Reduction for Nonconvex-Strongly Concave Smooth Minimax Problems

A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimiax Optimization

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Accelerated zero-order SGD under high-order smoothness and overparameterized regime

On Convergence of Incremental Gradient for Non-Convex Smooth Functions

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

Derivatives of Stochastic Gradient Descent in parametric optimization

Asynchronous Accelerated Stochastic Gradient Descent.

Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm

A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization

Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models via Reparameterisation and Smoothing

Learning Algorithm Hyperparameters for Fast Parametric Convex Optimization

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

An Alternative View: When Does SGD Escape Local Minima?

Enhancing Stochastic Gradient Descent: A Unified Framework and Novel Acceleration Methods for Faster Convergence

A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems