Abstract:Deep learning model is a multi-layered network structure, and the network parameters that evaluate the final performance of the model must be trained by a deep learning optimizer. In comparison to the mainstream optimizers that utilize integer-order derivatives reflecting only local information, fractional-order derivatives optimizers, which can capture global information, are gradually gaining attention. However, relying solely on the long-term estimated gradients computed from fractional-order derivatives while disregarding the influence of recent gradients on the optimization process can sometimes lead to issues such as local optima and slower optimization speeds. In this paper, we design an adaptive learning rate optimizer called AdaGL based on the Grünwald–Letnikov (G–L) fractional-order derivative. It changes the direction and step size of parameter updating dynamically according to the long-term and short-term gradients information, addressing the problem of falling into local minima or saddle points. To be specific, by utilizing the global memory of fractional-order calculus, we replace the gradient of parameter update with G–L fractional-order approximated gradient, making better use of the long-term curvature information in the past. Furthermore, considering that the recent gradient information often impacts the optimization phase significantly, we propose a step size control coefficient to adjust the learning rate in real-time. To compare the performance of the proposed AdaGL with the current advanced optimizers, we conduct several different deep learning tasks, including image classification on CNNs, node classification and graph classification on GNNs, image generation on GANs, and language modeling on LSTM. Extensive experimental results demonstrate that AdaGL achieves stable and fast convergence, excellent accuracy, and good generalization performance.

ADLER -- An efficient Hessian-based strategy for adaptive learning rate

ADADELTA: An Adaptive Learning Rate Method

Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses

Adaptive Step Sizes for Preconditioned Stochastic Gradient Descent

Adaptive Learning Rates with Maximum Variation Averaging.

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

An Adaptive Stochastic Gradient Method with Non-negative Gauss-Newton Stepsizes

Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning

Angle based dynamic learning rate for gradient descent

Fast Unconstrained Optimization via Hessian Averaging and Adaptive Gradient Sampling Methods

An adaptive Hessian approximated stochastic gradient MCMC method

An Adaptive and Momental Bound Method for Stochastic Learning

Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees

Differentiable Self-Adaptive Learning Rate

A New Adaptive Gradient Method with Gradient Decomposition

The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

Non-convergence of Adam and other adaptive stochastic gradient descent optimization methods for non-vanishing learning rates

MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients

An Adaptive Learning Rate Deep Learning Optimizer Using Long and Short-Term Gradients Based on G–L Fractional-Order Derivative

Reducing Adversarial Training Cost with Gradient Approximation

Finite-sum optimization: Adaptivity to smoothness and loopless variance reduction