Abstract:Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak's Heavy Ball method (HB) and Nesterov's method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function.

On the fast convergence of minibatch heavy ball momentum

(Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise

An Asymptotic Analysis of Random Partition Based Minibatch Momentum Methods for Linear Regression Models

Normalized Stochastic Heavy Ball with Adaptive Momentum1

Stochastic Momentum Method with Double Acceleration for Regularized Empirical Risk Minimization

Exploring the Inefficiency of Heavy Ball As Momentum Parameter Approaches 1

Minibatch and Momentum Model-based Methods for Stochastic Weakly Convex Optimization

Normalized Stochastic Heavy Ball with Adaptive Momentum<span Ref-Type="fn" Rid="faia230568_fn001" Style="display:None"> <sup>1</sup> &Lt;/span>

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Continuous Time Analysis of Momentum Methods

Accelerated Stochastic Min-Max Optimization Based on Bias-corrected Momentum

Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization

Accelerated Learning for Restricted Boltzmann Machine with a Novel Momentum Algorithm

On the Convergence Analysis of Aggregated Heavy-Ball Method

Nonsmooth Nonconvex Stochastic Heavy Ball

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

Accelerated Over-Relaxation Heavy-Ball Methods with Provable Acceleration and Global Convergence

Research on RBM Accelerating Learning Algorithm with Weight Momentum

On adaptive stochastic heavy ball momentum for solving linear systems

Improving Stochastic Cubic Newton with Momentum