Abstract:In large-scale learning algorithms, the momentum term is usually included in the stochastic sub-gradient method to improve the learning speed because it can navigate ravines efficiently to reach a local minimum. However, step-size and momentum weight hyper-parameters must be appropriately tuned to optimize convergence. We thus analyze the convergence rate using stochastic programming with Polyak's acceleration of two commonly used step-size learning rates: ``diminishing-to-zero" and ``constant-and-drop" (where the sequence is divided into stages and a constant step-size is applied at each stage) under strongly convex functions over a compact convex set with bounded sub-gradients. For the former, we show that the convergence rate can be written as a product of exponential in step-size and polynomial in momentum weight. Our analysis justifies the convergence of using the default momentum weight setting and the diminishing-to-zero step-size sequence in large-scale machine learning software. For the latter, we present the condition for the momentum weight sequence to converge at each stage.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to optimize the convergence rate of the Stochastic Gradient Method (SGM) in large - scale machine - learning algorithms. Specifically, the paper focuses on the influence of two hyper - parameters, step - size and momentum weight, on the convergence rate of SGM. Under the assumptions of strongly convex functions and bounded sub - gradients, the paper analyzes the convergence rates under two common step - size sequences - "diminishing - to - zero" and "constant - and - drop".
### Main Contributions:
1. **Diminishing - to - zero step - size sequence**:
- The paper shows that when using the diminishing - to - zero step - size sequence, the mean - square - error convergence rate of SGM can be expressed as a function of independent hyper - parameter sequences of step - size and momentum weight.
- Specifically, for the diminishing - to - zero step - size sequence, the convergence rate is \( O\left(e^{-\left(m \sum_{i = 1}^{N - 1}t_i+\frac{m^2}{2}\sum_{j = 1}^{N - 1}t_j^2\right)(1 + \sum_{i = 1}^{N}\eta_i)}\right) \).
- This result indicates that in the worst - case scenario, the convergence rate of SGM is the same as that of SG, but in practical applications, momentum can help avoid getting trapped in local minima, thus accelerating convergence.
2. **Constant - and - drop step - size sequence**:
- For the constant - and - drop step - size sequence in the multi - stage learning strategy, the paper analyzes the conditions that the momentum - weight sequence in each stage needs to satisfy to ensure the mean - square - error convergence.
- Specifically, if the momentum - weight sequence in each stage satisfies \( \sum_{i = 0}^{j}\frac{\eta_i}{j + 1}\to0 \) and \( \eta_i<1 \), then the mean - square - error can converge.
- This result supports the strategy of keeping the momentum weight constant or gradually decreasing, which is commonly used in practice.
### Significance:
- **Theoretical Support**: The paper provides a theoretical basis, explaining why in practical applications, even in the worst - case scenario, the convergence rate of SGM is the same as that of SG, but momentum still helps improve performance.
- **Practical Guidance**: By analyzing the convergence characteristics of different step - size and momentum - weight sequences, the paper provides guidance for hyper - parameter tuning in practical applications, reducing the time cost of parameter tuning.
### Related Work:
- The paper reviews research in related fields, including the convergence - rate analysis of SG and SGM under different assumptions, and the application of momentum methods in deep learning.
- It specifically mentions the difference between Polyak momentum and Nesterov momentum and discusses the performance of these methods in optimization problems.
### Conclusion:
- The paper reveals the convergence characteristics of SGM under different step - size and momentum - weight sequences through rigorous mathematical analysis.
- Although in the worst - case scenario, the convergence rate of SGM is the same as that of SG, in practical applications, the use of momentum can significantly improve performance, especially in the training of deep neural networks.
Hope this summary is helpful for you to understand the content of the paper! If you have more specific questions, feel free to continue asking.