Abstract:Gradient descent has been a central training principle for artificial neural networks from the early beginnings to today's deep learning networks. The most common implementation is the backpropagation algorithm for training feed-forward neural networks in a supervised fashion. Backpropagation involves computing the gradient of a loss function, with respect to the weights of the network, to update the weights and thus minimize loss. Although the mean square error is often used as a loss function, the general stochastic gradient descent principle does not immediately connect with a specific loss function. Another drawback of backpropagation has been the search for optimal values of two important training parameters, learning rate and momentum weight, which are determined empirically in most systems. The learning rate specifies the step size towards a minimum of the loss function when following the gradient, while the momentum weight considers previous weight changes when updating current weights. Using both parameters in conjunction with each other is generally accepted as a means to improving training, although their specific values do not follow immediately from standard backpropagation theory. This paper proposes a new information-theoretical loss function motivated by neural signal processing in a synapse. The new loss function implies a specific learning rate and momentum weight, leading to empirical parameters often used in practice. The proposed framework also provides a more formal explanation of the momentum term and its smoothing effect on the training process. All results taken together show that loss, learning rate, and momentum are closely connected. To support these theoretical findings, experiments for handwritten digit recognition show the practical usefulness of the proposed loss function and training parameters.

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Continuous Time Analysis of Momentum Methods

Flatter, faster: scaling momentum for optimal speedup of SGD

Losing momentum in continuous-time stochastic optimisation

The Marginal Value of Momentum for Small Learning Rate SGD

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Momentum Doesn't Change the Implicit Bias.

Gradient Descent with Polyak's Momentum Finds Flatter Minima via Large Catapults

Improving Deep Neural Networks' Training for Image Classification with Nonlinear Conjugate Gradient-Style Adaptive Momentum.

Research on RBM Accelerating Learning Algorithm with Weight Momentum

Does Momentum Change the Implicit Regularization on Separable Data?

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks

The Golden Ratio of Learning and Momentum

How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case Studies

Convergence of the Iterates for Momentum and RMSProp for Local Smooth Functions: Adaptation is the Key

Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization

Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data

Enhancing Time Series Momentum Strategies Using Deep Neural Networks

MomentumRNN: Integrating Momentum into Recurrent Neural Networks

Saddle-to-Saddle Dynamics in Diagonal Linear Networks

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight