Abstract:Deep learning model is a multi-layered network structure, and the network parameters that evaluate the final performance of the model must be trained by a deep learning optimizer. In comparison to the mainstream optimizers that utilize integer-order derivatives reflecting only local information, fractional-order derivatives optimizers, which can capture global information, are gradually gaining attention. However, relying solely on the long-term estimated gradients computed from fractional-order derivatives while disregarding the influence of recent gradients on the optimization process can sometimes lead to issues such as local optima and slower optimization speeds. In this paper, we design an adaptive learning rate optimizer called AdaGL based on the Grünwald–Letnikov (G–L) fractional-order derivative. It changes the direction and step size of parameter updating dynamically according to the long-term and short-term gradients information, addressing the problem of falling into local minima or saddle points. To be specific, by utilizing the global memory of fractional-order calculus, we replace the gradient of parameter update with G–L fractional-order approximated gradient, making better use of the long-term curvature information in the past. Furthermore, considering that the recent gradient information often impacts the optimization phase significantly, we propose a step size control coefficient to adjust the learning rate in real-time. To compare the performance of the proposed AdaGL with the current advanced optimizers, we conduct several different deep learning tasks, including image classification on CNNs, node classification and graph classification on GNNs, image generation on GANs, and language modeling on LSTM. Extensive experimental results demonstrate that AdaGL achieves stable and fast convergence, excellent accuracy, and good generalization performance.

Adaptive Hierarchical Hyper-gradient Descent

Differentiable Self-Adaptive Learning Rate

An Adaptive Mechanism to Achieve Learning Rate Dynamically

Adaptive Gradient Methods with Dynamic Bound of Learning Rate.

Barzilai-Borwein-based Adaptive Learning Rate for Deep Learning

Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

Hyper-Learning for Gradient-Based Batch Size Adaptation

FedHyper: A Universal and Robust Learning Rate Scheduler for Federated Learning with Hypergradient Descent

Adaptive Gradient Method with Resilience and Momentum

Gradient descent revisited via an adaptive online learning rate

Learning Gradient Descent: Better Generalization and Longer Horizons

Asgd: Stochastic Gradient Descent with Adaptive Batch Size for Every Parameter

Angle based dynamic learning rate for gradient descent

An Adaptive Learning Rate Deep Learning Optimizer Using Long and Short-Term Gradients Based on G–L Fractional-Order Derivative

Adaptive Learning Rates with Maximum Variation Averaging.

Gradient Descent: The Ultimate Optimizer

Towards optimal hierarchical training of neural networks

ADADELTA: An Adaptive Learning Rate Method

Gradient Descent Optimization in Deep Learning Model Training Based on Multistage and Method Combination Strategy

LLR: Learning Learning Rates by LSTM for Training Neural Networks.

A Zeroth-Order Adaptive Learning Rate Method to Reduce Cost of Hyperparameter Tuning for Deep Learning