Abstract:In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero" and ``constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to optimize the convergence rate of the Stochastic Gradient Method (SGM) in large - scale machine - learning algorithms. Specifically, the paper focuses on the influence of two hyper - parameters, step - size and momentum weight, on the convergence rate of SGM. Under the assumptions of strongly convex functions and bounded sub - gradients, the paper analyzes the convergence rates under two common step - size sequences - "diminishing - to - zero" and "constant - and - drop". ### Main Contributions: 1. **Diminishing - to - zero step - size sequence**: - The paper shows that when using the diminishing - to - zero step - size sequence, the mean - square - error convergence rate of SGM can be expressed as a function of independent hyper - parameter sequences of step - size and momentum weight. - Specifically, for the diminishing - to - zero step - size sequence, the convergence rate is \( O\left(e^{-\left(m \sum_{i = 1}^{N - 1}t_i+\frac{m^2}{2}\sum_{j = 1}^{N - 1}t_j^2\right)(1 + \sum_{i = 1}^{N}\eta_i)}\right) \). - This result indicates that in the worst - case scenario, the convergence rate of SGM is the same as that of SG, but in practical applications, momentum can help avoid getting trapped in local minima, thus accelerating convergence. 2. **Constant - and - drop step - size sequence**: - For the constant - and - drop step - size sequence in the multi - stage learning strategy, the paper analyzes the conditions that the momentum - weight sequence in each stage needs to satisfy to ensure the mean - square - error convergence. - Specifically, if the momentum - weight sequence in each stage satisfies \( \sum_{i = 0}^{j}\frac{\eta_i}{j + 1}\to0 \) and \( \eta_i<1 \), then the mean - square - error can converge. - This result supports the strategy of keeping the momentum weight constant or gradually decreasing, which is commonly used in practice. ### Significance: - **Theoretical Support**: The paper provides a theoretical basis, explaining why in practical applications, even in the worst - case scenario, the convergence rate of SGM is the same as that of SG, but momentum still helps improve performance. - **Practical Guidance**: By analyzing the convergence characteristics of different step - size and momentum - weight sequences, the paper provides guidance for hyper - parameter tuning in practical applications, reducing the time cost of parameter tuning. ### Related Work: - The paper reviews research in related fields, including the convergence - rate analysis of SG and SGM under different assumptions, and the application of momentum methods in deep learning. - It specifically mentions the difference between Polyak momentum and Nesterov momentum and discusses the performance of these methods in optimization problems. ### Conclusion: - The paper reveals the convergence characteristics of SGM under different step - size and momentum - weight sequences through rigorous mathematical analysis. - Although in the worst - case scenario, the convergence rate of SGM is the same as that of SG, in practical applications, the use of momentum can significantly improve performance, especially in the training of deep neural networks. Hope this summary is helpful for you to understand the content of the paper! If you have more specific questions, feel free to continue asking.

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

The Anytime Convergence of Stochastic Gradient Descent with Momentum: From a Continuous-Time Perspective

An Improved Analysis of Stochastic Gradient Descent with Momentum

Continuous Time Analysis of Momentum Methods

Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting

Convergence Analysis of Asynchronous Stochastic Recursive Gradient Algorithms

Combining Conjugate Gradient and Momentum for Unconstrained Stochastic Optimization With Applications to Machine Learning

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum

Convergence Analysis of Accelerated Stochastic Gradient Descent under the Growth Condition

Convergence of Gradient Algorithms for Nonconvex C1+α Cost Functions

Non-Convex Stochastic Composite Optimization with Polyak Momentum

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Convergence of SGD with momentum in the nonconvex case: A time window-based analysis

Generalized Polyak Step Size for First Order Optimization with Momentum

High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise

Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance

Shuffling Momentum Gradient Algorithm for Convex Optimization

Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic Gradient Descent using Stochastic Learning Rates

Almost sure convergence rates of stochastic gradient methods under gradient domination