Abstract:Optimization algorithms with momentum have been widely used for building deep learning models because of the fast convergence rate. Momentum helps accelerate Stochastic gradient descent in relevant directions in parameter updating, minifying the oscillations of the parameters update route. The gradient of each step in optimization algorithms with momentum is calculated by a part of the training samples, so there exists stochasticity, which may bring errors to parameter updates. In this case, momentum placing the influence of the last step to the current step with a fixed weight is obviously inaccurate, which propagates the error and hinders the correction of the current step. Besides, such a hyperparameter can be extremely hard to tune in applications as well. In this paper, we introduce a novel optimization algorithm, namely, Discriminative wEight on Adaptive Momentum (DEAM). Instead of assigning the momentum term weight with a fixed hyperparameter, DEAM proposes to compute the momentum weight automatically based on the discriminative angle. The momentum term weight will be assigned with an appropriate value that configures momentum in the current step. In this way, DEAM involves fewer hyperparameters. DEAM also contains a novel backtrack term, which restricts redundant updates when the correction of the last step is needed. The backtrack term can effectively adapt the learning rate and achieve the anticipatory update as well. Extensive experiments demonstrate that DEAM can achieve a faster convergence rate than the existing optimization algorithms in training the deep learning models of both convex and nonconvex situations.

Momentum is All You Need for Data-Driven Adaptive Optimization

ADINE: An Adaptive Momentum Method for Stochastic Gradient Descent

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

A new non-adaptive optimization method: Stochastic gradient descent with momentum and difference

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

Stochastic Momentum Method with Double Acceleration for Regularized Empirical Risk Minimization

A fast adaptive algorithm for training deep neural networks

DEAM: Adaptive Momentum with Discriminative Weight for Stochastic Optimization

A Unified Analysis of AdaGrad With Weighted Aggregation and Momentum Acceleration

Bidirectional Looking with A Novel Double Exponential Moving Average to Adaptive and Non-adaptive Momentum Optimizers

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

A modification of adaptive moment estimation (adam) for machine learning

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

Adaptive momentum with discriminative weight for neural network stochastic optimization

ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization

AMAdam: adaptive modifier of Adam method

AdaDB: an Adaptive Gradient Method with Data-Dependent Bound.

Optimal Adaptive and Accelerated Stochastic Gradient Descent

The AdEMAMix Optimizer: Better, Faster, Older