Abstract:Stochastic gradient descent (SGD) has been a go-to algorithm for nonconvex stochastic optimization problems arising in machine learning. Its theory however often requires a strong framework to guarantee convergence properties. We hereby present a full scope convergence study of biased nonconvex SGD, including weak convergence, function-value convergence and global convergence, and also provide subsequent convergence rates and complexities, all under relatively mild conditions in comparison with literature.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is about the convergence analysis of Stochastic Gradient Descent (SGD) in non - convex stochastic optimization problems. Specifically, the author re - examines SGD and, under relatively mild conditions, conducts a comprehensive convergence study on non - convex SGD with bias, including weak convergence, function - value convergence and global convergence, and provides the corresponding convergence rates and complexity. ### Main problem decomposition 1. **Convergence of non - convex optimization problems**: - Traditional SGD theory usually requires a strong framework to ensure its convergence, especially in non - convex cases. - The paper aims to provide a broader convergence analysis framework applicable to non - convex and biased SGD. 2. **Different convergence modes**: - **Weak convergence**: It means that the gradient \(\nabla F(\theta_n)\) converges to zero almost everywhere. - **Function - value convergence**: It means that the target function value \(F(\theta_n)\) converges almost everywhere. - **Global convergence**: It means that the iteration point \(\theta_n\) converges to a stationary point almost everywhere. 3. **Convergence rate and complexity**: - Study the convergence rate under different convergence modes and derive the optimal complexity to meet the given tolerance requirements. 4. **Application of local Łojasiewicz condition**: - Use the local Łojasiewicz condition to describe the behavior of non - convex loss functions, thereby deriving more accurate convergence rates. ### Formula representation - **Local Łojasiewicz condition**: \[ |F(\theta) - F(\theta^\star)|^\beta \leq \zeta \|\nabla F(\theta)\|, \quad \theta \in V \] where \(\theta^\star\) is a stationary point, \(V\) is an open neighborhood of \(\theta^\star\), \(\beta\in(0, 1)\), \(\zeta > 0\). - **Function - value convergence**: \[ F(\theta_n) - F^\star \leq C_\delta \left( \sum_{k = 1}^n \gamma_k \right)^{-\frac{1}{2\beta}} \quad \text{with probability at least } 1 - \delta \] where \(\gamma_k\) is the learning rate sequence, \(\beta\) is the Łojasiewicz exponent. ### Conclusion By introducing new assumptions and methods, this paper overcomes some limitations of traditional SGD theory in non - convex optimization problems. In particular, without relying on the assumption of boundedness of iteration points, it provides broader and more practical convergence results. This provides an important theoretical basis for understanding and improving optimization algorithms in deep learning.

Stochastic Gradient Descent Revisited

Stochastic Gradient Descent with Biased but Consistent Gradient Estimators

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation

Stochastic Gradient Descent in the Viewpoint of Graduated Optimization

Demystifying the Myths and Legends of Nonconvex Convergence of SGD

A new non-convex framework to improve asymptotical knowledge on generic stochastic gradient descent

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

The convergence of the Stochastic Gradient Descent (SGD) : a self-contained proof

An Algebraically Converging Stochastic Gradient Descent Algorithm for Global Optimization

Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem

Convergence and concentration properties of constant step-size SGD through Markov chains

Almost Sure Convergence of Randomised‐difference Descent Algorithm for Stochastic Convex Optimisation

Stochastic Subgradient Descent Escapes Active Strict Saddles on Weakly Convex Functions

Beyond Convexity: Stochastic Quasi-Convex Optimization

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

Derivatives of Stochastic Gradient Descent in parametric optimization

Guided Stochastic Gradient Descent Algorithm for inconsistent datasets

On the Convergence and Improvement of Stochastic Normalized Gradient Descent

Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions