Abstract:Gradient descent-based optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak's Heavy Ball method (HB) and Nesterov's method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a time-rescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function.

Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key

Random Reshuffling with Momentum for Nonconvex Problems: Iteration Complexity and Last Iterate Convergence

Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

On the O(√(d)/T^1/4) Convergence Rate of RMSProp and Its Momentum Extension Measured by ℓ_1 Norm

Last-iterate convergence analysis of stochastic momentum methods for neural networks

Convergence of the momentum method for semialgebraic functions with locally Lipschitz gradients

Continuous Time Analysis of Momentum Methods

Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithms

CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing

Convergence and Stability of the Stochastic Proximal Point Algorithm with Momentum

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks

Convergence of an online gradient method with inner-product penalty and adaptive momentum

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Implicit Regularization and Momentum Algorithms in Nonlinearly Parameterized Adaptive Control and Prediction

An Abstract Lyapunov Control Optimizer: Local Stabilization and Global Convergence

On Hyper-Parameter Selection for Guaranteed Convergence of RMSProp

Gauss-Newton Dynamics for Neural Networks: A Riemannian Optimization Perspective

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Convergence of Batch Gradient Learning with Smoothing Regularization and Adaptive Momentum for Neural Networks