Abstract:We consider stochastic convex optimization problems where the objective is an expectation over smooth functions. For this setting we suggest a novel gradient estimate that combines two recent mechanism that are related to notion of momentum. Then, we design an SGD-style algorithm as well as an accelerated version that make use of this new estimator, and demonstrate the robustness of these new approaches to the choice of the learning rate. Concretely, we show that these approaches obtain the optimal convergence rates for both noiseless and noisy case with the same choice of fixed learning rate. Moreover, for the noisy case we show that these approaches achieve the same optimal bound for a very wide range of learning rates.

What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper aims to solve an important problem in Stochastic Convex Optimization (SCO). Specifically, when the objective function is the expectation of a smooth function, how to design a new gradient estimation method so that the optimization algorithm is more robust to the choice of learning rate and can obtain the optimal convergence rate. More precisely, the paper proposes a new algorithm named µ² - SGD and its accelerated version µ² - Extra SGD. These algorithms improve the traditional Stochastic Gradient Descent (SGD) method by combining two momentum - related mechanisms. ### Main contributions 1. **New gradient estimation** - A new gradient estimation method is proposed, whose squared error \( \|\epsilon_t\|^2 \) decreases as the number of updates \( t \) increases, that is, \( \|\epsilon_t\|^2 \propto \frac{1}{t} \). This is in contrast to the standard SGD estimator which usually has a fixed variance \( O(1) \). - The new estimation method combines two momentum - related mechanisms: Anytime Averaging and corrected momentum technique. 2. **Algorithm design** - Two algorithms based on the new gradient estimation are designed: µ² - SGD and µ² - Extra SGD. - These algorithms show robustness to the choice of learning rate when a fixed learning rate is selected, and can obtain the optimal convergence rate in both noisy and noiseless cases. 3. **Convergence performance** - For the noiseless case, µ² - SGD can reach a convergence rate of \( O\left(\frac{L}{T}\right) \) using the same fixed learning rate \( \eta_{\text{offline}}=\frac{1}{8L T} \). - For the noisy case, µ² - SGD can maintain the same optimal convergence rate \( O\left(\frac{L}{T}+\frac{\tilde{\sigma}}{\sqrt{T}}\right) \) within a very wide range of learning rates (that is, \( \eta \in [\eta_{\text{noisy}}, \eta_{\text{offline}}] \), where \( \eta_{\text{offline}} / \eta_{\text{noisy}}\approx\left(\frac{\tilde{\sigma}}{L}\right)\sqrt{T} \)). - µ² - Extra SGD can reach an optimal convergence rate of \( O\left(\frac{L}{T^2}\right) \) in the noiseless case and \( O\left(\frac{L}{T^2}+\frac{\tilde{\sigma}}{\sqrt{T}}\right) \) in the noisy case, and can maintain the same optimal convergence rate within an extremely wide range of learning rates (that is, \( \eta \in [\eta_{\text{noisy}}, \eta_{\text{offline}}] \), where \( \eta_{\text{offline}} / \eta_{\text{noisy}}\approx\left(\frac{\tilde{\sigma}}{L}\right)T^{3 / 2} \)). ### Related work - **Gradient Descent (GD) and Stochastic Gradient Descent (SGD)**: They are the cornerstones in the fields of machine learning and optimization and are widely used in various problems. - **Adaptive learning rate methods**: Such as AdaGrad, Adam, etc., improve performance by implicitly adjusting the learning rate during the training process. - **Momentum methods**: Such as Polyak's heavy - ball method and

$μ^2$-SGD: Stable Stochastic Optimization via a Double Momentum Mechanism

An Improved Analysis of Stochastic Gradient Descent with Momentum

Stochastic Momentum Method with Double Acceleration for Regularized Empirical Risk Minimization

Demystifying SGD with Doubly Stochastic Gradients

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

Exponential convergence rates for momentum stochastic gradient descent in the overparametrized setting

Accelerated stochastic approximation with state-dependent noise

The Anytime Convergence of Stochastic Gradient Descent with Momentum: From a Continuous-Time Perspective

Improved Learning Rates for Stochastic Optimization: Two Theoretical Viewpoints

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent

A Diffusion Approximation Theory of Momentum SGD in Nonconvex Optimization

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Convergence of SGD with momentum in the nonconvex case: A time window-based analysis

Improving Stochastic Cubic Newton with Momentum

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

mPage: Probabilistic Gradient Estimator With Momentum for Non-Convex Optimization

Langevin Dynamics: A Unified Perspective on Optimization via Lyapunov Potentials

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails.

Stochastic Methods in Variational Inequalities: Ergodicity, Bias and Refinements