Abstract:Training a modern machine learning architecture on a new task requires extensive learning-rate tuning, which comes at a high computational cost. Here we develop new Polyak-type adaptive learning rates that can be used on top of any momentum method, and require less tuning to perform well. We first develop MoMo, a Momentum Model based adaptive learning rate for SGD-M (stochastic gradient descent with momentum). MoMo uses momentum estimates of the losses and gradients sampled at each iteration to build a model of the loss function. Our model makes use of any known lower bound of the loss function by using truncation, e.g. most losses are lower-bounded by zero. The model is then approximately minimized at each iteration to compute the next step. We show how MoMo can be used in combination with any momentum-based method, and showcase this by developing MoMo-Adam, which is Adam with our new model-based adaptive learning rate. We show that MoMo attains a $\mathcal{O}(1/\sqrt{K})$ convergence rate for convex problems with interpolation, needing knowledge of no problem-specific quantities other than the optimal value. Additionally, for losses with unknown lower bounds, we develop on-the-fly estimates of a lower bound, that are incorporated in our model. We show that MoMo and MoMo-Adam improve over SGD-M and Adam in terms of robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR, and Imagenet, for recommender systems on Criteo, for a transformer model on the translation task IWSLT14, and for a diffusion model.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the huge computational cost required for learning rate adjustment during the training process of modern machine - learning models. Specifically: 1. **High cost of learning rate parameter tuning**: Training a modern large - scale neural network may require more than $1 million in computational resources. Among them, in order to find an appropriate learning rate, multiple experiments and parameter tuning are usually required, which further increases the computational cost. 2. **Limitations of existing methods**: Existing optimization algorithms (such as SGD - Momentum and Adam) are very sensitive to the learning rate on different tasks and require a large amount of hyper - parameter tuning to achieve the best performance. This tuning is not only time - consuming but also computationally expensive. ### Solutions proposed in the paper To solve the above problems, this paper proposes MoMo (Momentum Models for Adaptive Learning Rates), an adaptive learning rate method based on momentum models. The main features of MoMo include: - **Adaptive learning rate**: MoMo has developed a new Polyak - type adaptive learning rate, which can be combined with any momentum method, reducing the need for manual learning rate tuning. - **Model construction**: MoMo uses the momentum estimates of the sampled losses and gradients in each iteration to construct a model of the loss function and ensures that the model does not fall below the known lower bound (such as zero) through truncation. - **Wide application**: The paper shows that MoMo can be applied to a variety of tasks, including image classification (MNIST, CIFAR, ImageNet), recommendation systems (Criteo), translation tasks (IWSLT14), and diffusion models, etc. ### Main contributions - **Theoretical analysis**: It is proved that the convergence rate of MoMo for convex problems is $O(1/\sqrt{K})$, and there is no need to know the specific parameters of the problem (such as the optimal value). - **Robustness improvement**: Experiments show that MoMo and its variants (such as MoMo - Adam) are more robust to the choice of learning rate on different tasks compared to traditional SGD - M and Adam, and can automatically perform learning rate warm - up and decay, thereby improving the stability and performance of training. ### Formula presentation The core update formulas of MoMo are as follows: \[ \tau_k := \min \left\{ \frac{\alpha_k}{\rho_k}, \frac{\left( \bar{f}_k + \langle d_k, x_k \rangle - \gamma_k - \rho_k f_k^* \right)^+}{\|d_k\|^2} \right\} \] \[ x_{k+1} = x_k - \tau_k d_k \] where: - $ \alpha_k $ is the user - specified learning rate, - $ \rho_k $ is the sum of the weighting coefficients, - $ \bar{f}_k $ is the weighted average of past loss values, - $ d_k $ is the weighted average of past gradients, - $ \gamma_k $ is the weighted average of the inner product of gradients and parameters, - $ f_k^* $ is the lower - bound estimate of the loss function. In this way, MoMo can adaptively adjust the learning rate in each iteration, thereby reducing the need for manual parameter tuning and improving training efficiency.

MoMo: Momentum Models for Adaptive Learning Rates

MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

Momentum is All You Need for Data-Driven Adaptive Optimization

ADINE: An Adaptive Momentum Method for Stochastic Gradient Descent

ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

An Adaptive and Momental Bound Method for Stochastic Learning

DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization

Flatter, faster: scaling momentum for optimal speedup of SGD

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Adaptive momentum with discriminative weight for neural network stochastic optimization

Losing momentum in continuous-time stochastic optimisation

DeMo: Decoupled Momentum Optimization

Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees

ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Stochastic Momentum Method with Double Acceleration for Regularized Empirical Risk Minimization

FASTERMOE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models

Adaptive Learning Rates with Maximum Variation Averaging.