$μ^2$-SGD: Stable Stochastic Optimization via a Double Momentum Mechanism

Kfir Y. Levy
2023-04-09
Abstract:We consider stochastic convex optimization problems where the objective is an expectation over smooth functions. For this setting we suggest a novel gradient estimate that combines two recent mechanism that are related to notion of momentum. Then, we design an SGD-style algorithm as well as an accelerated version that make use of this new estimator, and demonstrate the robustness of these new approaches to the choice of the learning rate. Concretely, we show that these approaches obtain the optimal convergence rates for both noiseless and noisy case with the same choice of fixed learning rate. Moreover, for the noisy case we show that these approaches achieve the same optimal bound for a very wide range of learning rates.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to solve an important problem in Stochastic Convex Optimization (SCO). Specifically, when the objective function is the expectation of a smooth function, how to design a new gradient estimation method so that the optimization algorithm is more robust to the choice of learning rate and can obtain the optimal convergence rate. More precisely, the paper proposes a new algorithm named µ² - SGD and its accelerated version µ² - Extra SGD. These algorithms improve the traditional Stochastic Gradient Descent (SGD) method by combining two momentum - related mechanisms. ### Main contributions 1. **New gradient estimation** - A new gradient estimation method is proposed, whose squared error \( \|\epsilon_t\|^2 \) decreases as the number of updates \( t \) increases, that is, \( \|\epsilon_t\|^2 \propto \frac{1}{t} \). This is in contrast to the standard SGD estimator which usually has a fixed variance \( O(1) \). - The new estimation method combines two momentum - related mechanisms: Anytime Averaging and corrected momentum technique. 2. **Algorithm design** - Two algorithms based on the new gradient estimation are designed: µ² - SGD and µ² - Extra SGD. - These algorithms show robustness to the choice of learning rate when a fixed learning rate is selected, and can obtain the optimal convergence rate in both noisy and noiseless cases. 3. **Convergence performance** - For the noiseless case, µ² - SGD can reach a convergence rate of \( O\left(\frac{L}{T}\right) \) using the same fixed learning rate \( \eta_{\text{offline}}=\frac{1}{8L T} \). - For the noisy case, µ² - SGD can maintain the same optimal convergence rate \( O\left(\frac{L}{T}+\frac{\tilde{\sigma}}{\sqrt{T}}\right) \) within a very wide range of learning rates (that is, \( \eta \in [\eta_{\text{noisy}}, \eta_{\text{offline}}] \), where \( \eta_{\text{offline}} / \eta_{\text{noisy}}\approx\left(\frac{\tilde{\sigma}}{L}\right)\sqrt{T} \)). - µ² - Extra SGD can reach an optimal convergence rate of \( O\left(\frac{L}{T^2}\right) \) in the noiseless case and \( O\left(\frac{L}{T^2}+\frac{\tilde{\sigma}}{\sqrt{T}}\right) \) in the noisy case, and can maintain the same optimal convergence rate within an extremely wide range of learning rates (that is, \( \eta \in [\eta_{\text{noisy}}, \eta_{\text{offline}}] \), where \( \eta_{\text{offline}} / \eta_{\text{noisy}}\approx\left(\frac{\tilde{\sigma}}{L}\right)T^{3 / 2} \)). ### Related work - **Gradient Descent (GD) and Stochastic Gradient Descent (SGD)**: They are the cornerstones in the fields of machine learning and optimization and are widely used in various problems. - **Adaptive learning rate methods**: Such as AdaGrad, Adam, etc., improve performance by implicitly adjusting the learning rate during the training process. - **Momentum methods**: Such as Polyak's heavy - ball method and