Abstract:While Standard gradient descent is one very popular optimisation method, its convergence cannot be proven beyond the class of functions whose gradient is globally Lipschitz continuous. As such, it is not actually applicable to realistic applications such as Deep Neural Networks. In this paper, we prove that its backtracking variant behaves very nicely, in particular convergence can be shown for all Morse functions. The main theoretical result of this paper is as follows. Theorem. Let $f:\mathbb{R}^k\rightarrow \mathbb{R}$ be a $C^1$ function, and $\{z_n\}$ a sequence constructed from the Backtracking gradient descent algorithm. (1) Either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\lim _{n\rightarrow\infty}||z_{n+1}-z_n||=0$. (2) Assume that $f$ has at most countably many critical points. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ converges to a critical point of $f$. (3) More generally, assume that all connected components of the set of critical points of $f$ are compact. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ is bounded. Moreover, in the latter case the set of cluster points of $\{z_n\}$ is connected. Some generalised versions of this result, including an inexact version, are included. Another result in this paper concerns the problem of saddle points. We then present a heuristic argument to explain why Standard gradient descent method works so well, and modifications of the backtracking versions of GD, MMT and NAG. Experiments with datasets CIFAR10 and CIFAR100 on various popular architectures verify the heuristic argument also for the mini-batch practice and show that our new algorithms, while automatically fine tuning learning rates, perform better than current state-of-the-art methods such as MMT, NAG, Adagrad, Adadelta, RMSProp, Adam and Adamax.

Computation of Generalized Derivatives for Abs-Smooth Functions by Backward Mode Algorithmic Differentiation and Implications to Deep Learning

Computation of Generalized Derivatives for Abs-Smooth Functions by Backward Mode Algorithmic Differentiation and Implications to Deep Learning

Understanding Automatic Differentiation Pitfalls

DiCE: The Infinitely Differentiable Monte-Carlo Estimator

GD doesn't make the cut: Three ways that non-differentiability affects neural network training

On the computation of the gradient in implicit neural networks

Backtracking gradient descent method for general $C^1$ functions, with applications to Deep Learning

Automatic differentiation in machine learning: a survey

On the Correctness of Automatic Differentiation for Neural Networks with Machine-Representable Parameters

BackPACK: Packing more into backprop

Tricks from Deep Learning

Generalizing Stochastic Smoothing for Differentiation and Gradient Estimation

The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Convergence of Constant Step Stochastic Gradient Descent for Non-Smooth Non-Convex Functions

Efficient differentiable quadratic programming layers: an ADMM approach

On Training Implicit Models

Reverse Differentiation via Predictive Coding

Fixed-Point Automatic Differentiation of Forward--Backward Splitting Algorithms for Partly Smooth Functions

A backward differential deep learning-based algorithm for solving high-dimensional nonlinear backward stochastic differential equations

Zero Coordinate Shift: Whetted Automatic Differentiation for Physics-informed Operator Learning