Abstract:Stochastic Gradient Descent (SGD) series optimization methods play the vital role in training neural networks, attracting growing attention in science and engineering fields of the intelligent system. The choice of learning rates affects the convergence rate of SGD series optimization methods. Currently, learning rate adjustment strategies mainly face the following problems: (1) The traditional learning rate decay method mainly adopts manual manner during training iterations, the small learning rate produced from which causes slow convergence in training neural networks. (2) Adaptive method (e.g., Adam) has poor generalization performance. To alleviate the above issues, we propose a novel automatic learning rate decay strategy for SGD optimization methods in neural networks. On the basis of the observation that the convergence rate's upper bound enjoys minimization in a specific iteration concerning the current learning rate, we first present the expression of the current learning rate determined by historical learning rates. And merely one extra parameter is initialized to generate automatic decreasing learning rates during the training process. Our proposed approach is applied to SGD and Momentum SGD optimization algorithms, and concrete theoretical proof explains its convergence. Numerical simulations are conducted on the MNIST and Cifar‐10 data sets with different neural networks. Experimental results show that our algorithm outperforms existing classical ones, achieving faster convergence rate, better stability, and generalization performance in neural network training. It also lays a foundation for large‐scale parallel search of initial parameters in intelligent systems.

Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent

A decreasing scaling transition scheme from Adam to SGD

Optimal Adaptive and Accelerated Stochastic Gradient Descent

Stochastic normalized gradient descent with momentum for large-batch training

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Flatter, faster: scaling momentum for optimal speedup of SGD

Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent

Multi-stage stochastic gradient method with momentum acceleration

Trend-Smooth: Accelerate Asynchronous SGD by Smoothing Parameters Using Parameter Trends

Convergence rates of stochastic gradient method with independent sequences of step-size and momentum weight

Asynchronous Accelerated Stochastic Gradient Descent.

Acceleration of stochastic gradient descent with momentum by averaging: finite-sample rates and asymptotic normality

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

When and Why Momentum Accelerates SGD:An Empirical Study

Pbsgd: Powered Stochastic Gradient Descent Methods for Accelerated Nonconvex Optimization

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Combining Conjugate Gradient and Momentum for Unconstrained Stochastic Optimization With Applications to Machine Learning

On the Convergence of Memory-Based Distributed SGD.

A new non-adaptive optimization method: Stochastic gradient descent with momentum and difference

A Diffusion Approximation Theory of Momentum SGD in Nonconvex Optimization

Towards Noise-adaptive, Problem-adaptive (Accelerated) Stochastic Gradient Descent