Abstract:Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal sized steps for all parameters, irrespective of gradient behavior. Hence, an efficient way of deep network optimization is to make adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp and Adam. These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this paper, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of online learning framework. Rigorous analysis is made in this paper over three synthetic complex non-convex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 datasets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet) based Convolutional Neural Networks (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at <a class="link-external link-https" href="https://github.com/shivram1987/diffGrad" rel="external noopener nofollow">this https URL</a>.

Gradient Descent Using Stochastic Circuits for Efficient Training of Learning Machines

Learnable Surrogate Gradient for Direct Training Spiking Neural Networks

Gradient Correction Beyond Gradient Descent

Asynchronous Accelerated Stochastic Gradient Descent.

FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning

Sparse Gradient Compression For Distributed Sgd

SSGD: A Safe and Efficient Method of Gradient Descent

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Multiplexed gradient descent: Fast online training of modern datasets on hardware neural networks without backpropagation

Fractional-order spike-timing-dependent gradient descent for multi-layer spiking neural networks

Online Learning for DNN Training: A Stochastic Block Adaptive Gradient Algorithm

"Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks

DC-S3GD: Delay-Compensated Stale-Synchronous SGD for Large-Scale Decentralized Neural Network Training

Asynchronous Stochastic Gradient Descent with Delay Compensation

An Energy-Efficient and Noise-Tolerant Recurrent Neural Network Using Stochastic Computing

Gradient Descent for Spiking Neural Networks

diffGrad: An Optimization Method for Convolutional Neural Networks

Gradient Decomposition Methods for Training Neural Networks with Non-ideal Synaptic Devices

An asynchronous distributed training algorithm based on Gossip communication and Stochastic Gradient Descent

An automatic learning rate decay strategy for stochastic gradient descent optimization methods in neural networks

Communication-Censored Distributed Stochastic Gradient Descent