Abstract:Deep learning model is a multi-layered network structure, and the network parameters that evaluate the final performance of the model must be trained by a deep learning optimizer. In comparison to the mainstream optimizers that utilize integer-order derivatives reflecting only local information, fractional-order derivatives optimizers, which can capture global information, are gradually gaining attention. However, relying solely on the long-term estimated gradients computed from fractional-order derivatives while disregarding the influence of recent gradients on the optimization process can sometimes lead to issues such as local optima and slower optimization speeds. In this paper, we design an adaptive learning rate optimizer called AdaGL based on the Grünwald–Letnikov (G–L) fractional-order derivative. It changes the direction and step size of parameter updating dynamically according to the long-term and short-term gradients information, addressing the problem of falling into local minima or saddle points. To be specific, by utilizing the global memory of fractional-order calculus, we replace the gradient of parameter update with G–L fractional-order approximated gradient, making better use of the long-term curvature information in the past. Furthermore, considering that the recent gradient information often impacts the optimization phase significantly, we propose a step size control coefficient to adjust the learning rate in real-time. To compare the performance of the proposed AdaGL with the current advanced optimizers, we conduct several different deep learning tasks, including image classification on CNNs, node classification and graph classification on GNNs, image generation on GANs, and language modeling on LSTM. Extensive experimental results demonstrate that AdaGL achieves stable and fast convergence, excellent accuracy, and good generalization performance.

AdaLip: An Adaptive Learning Rate Method per Layer for Stochastic Optimization

An Adaptive and Momental Bound Method for Stochastic Learning

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

AdamL: A fast adaptive gradient method incorporating loss function

Learning Rate Optimization for Deep Neural Networks Using Lipschitz Bandits

ADADELTA: An Adaptive Learning Rate Method

An Adaptive Learning Rate Deep Learning Optimizer Using Long and Short-Term Gradients Based on G–L Fractional-Order Derivative

AdaDB: an Adaptive Gradient Method with Data-Dependent Bound.

Appropriate Learning Rates of Adaptive Learning Rate Optimization Algorithms for Training Deep Neural Networks

LALR: Theoretical and Experimental validation of Lipschitz Adaptive Learning Rate in Regression and Neural Networks

An Automatic Learning Rate Schedule Algorithm for Achieving Faster Convergence and Steeper Descent

ABNGrad: adaptive step size gradient descent for optimizing neural networks

Adam: A Method for Stochastic Optimization

Understanding Stochastic Optimization Behavior at the Layer Update Level (Student Abstract)

Adaptive Learning Rates with Maximum Variation Averaging.

AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods

Improving Adaptive Online Learning Using Refined Discretization

Adaptive Learning Rates for Faster Stochastic Gradient Methods

CaAdam: Improving Adam optimizer using connection aware methods