Stochastic Gradient Descent Revisited

Azar Louzi
2024-12-09
Abstract:Stochastic gradient descent (SGD) has been a go-to algorithm for nonconvex stochastic optimization problems arising in machine learning. Its theory however often requires a strong framework to guarantee convergence properties. We hereby present a full scope convergence study of biased nonconvex SGD, including weak convergence, function-value convergence and global convergence, and also provide subsequent convergence rates and complexities, all under relatively mild conditions in comparison with literature.
Optimization and Control,Probability,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is about the convergence analysis of Stochastic Gradient Descent (SGD) in non - convex stochastic optimization problems. Specifically, the author re - examines SGD and, under relatively mild conditions, conducts a comprehensive convergence study on non - convex SGD with bias, including weak convergence, function - value convergence and global convergence, and provides the corresponding convergence rates and complexity. ### Main problem decomposition 1. **Convergence of non - convex optimization problems**: - Traditional SGD theory usually requires a strong framework to ensure its convergence, especially in non - convex cases. - The paper aims to provide a broader convergence analysis framework applicable to non - convex and biased SGD. 2. **Different convergence modes**: - **Weak convergence**: It means that the gradient \(\nabla F(\theta_n)\) converges to zero almost everywhere. - **Function - value convergence**: It means that the target function value \(F(\theta_n)\) converges almost everywhere. - **Global convergence**: It means that the iteration point \(\theta_n\) converges to a stationary point almost everywhere. 3. **Convergence rate and complexity**: - Study the convergence rate under different convergence modes and derive the optimal complexity to meet the given tolerance requirements. 4. **Application of local Łojasiewicz condition**: - Use the local Łojasiewicz condition to describe the behavior of non - convex loss functions, thereby deriving more accurate convergence rates. ### Formula representation - **Local Łojasiewicz condition**: \[ |F(\theta) - F(\theta^\star)|^\beta \leq \zeta \|\nabla F(\theta)\|, \quad \theta \in V \] where \(\theta^\star\) is a stationary point, \(V\) is an open neighborhood of \(\theta^\star\), \(\beta\in(0, 1)\), \(\zeta > 0\). - **Function - value convergence**: \[ F(\theta_n) - F^\star \leq C_\delta \left( \sum_{k = 1}^n \gamma_k \right)^{-\frac{1}{2\beta}} \quad \text{with probability at least } 1 - \delta \] where \(\gamma_k\) is the learning rate sequence, \(\beta\) is the Łojasiewicz exponent. ### Conclusion By introducing new assumptions and methods, this paper overcomes some limitations of traditional SGD theory in non - convex optimization problems. In particular, without relying on the assumption of boundedness of iteration points, it provides broader and more practical convergence results. This provides an important theoretical basis for understanding and improving optimization algorithms in deep learning.