Abstract:The deep learning recipe of casting real-world problems as mathematical optimisation and tackling the optimisation by training deep neural networks using gradient-based optimisation has undoubtedly proven to be a fruitful one. The understanding behind why deep learning works, however, has lagged behind its practical significance. We aim to make steps towards an improved understanding of deep learning with a focus on optimisation and model regularisation. We start by investigating gradient descent (GD), a discrete-time algorithm at the basis of most popular deep learning optimisation algorithms. Understanding the dynamics of GD has been hindered by the presence of discretisation drift, the numerical integration error between GD and its often studied continuous-time counterpart, the negative gradient flow (NGF). To add to the toolkit available to study GD, we derive novel continuous-time flows that account for discretisation drift. Unlike the NGF, these new flows can be used to describe learning rate specific behaviours of GD, such as training instabilities observed in supervised learning and two-player games. We then translate insights from continuous time into mitigation strategies for unstable GD dynamics, by constructing novel learning rate schedules and regularisers that do not require additional hyperparameters. Like optimisation, smoothness regularisation is another pillar of deep learning's success with wide use in supervised learning and generative modelling. Despite their individual significance, the interactions between smoothness regularisation and optimisation have yet to be explored. We find that smoothness regularisation affects optimisation across multiple deep learning domains, and that incorporating smoothness regularisation in reinforcement learning leads to a performance boost that can be recovered using adaptions to optimisation methods.

Improved Performance of Stochastic Gradients with Gaussian Smoothing

Gaussian smoothing gradient descent for minimizing functions (GSmoothGD)

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization

Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations

Stochastic Gradient Descent in the Viewpoint of Graduated Optimization

Laplacian Smoothing Gradient Descent

Diagonalisation SGD: Fast & Convergent SGD for Non-Differentiable Models via Reparameterisation and Smoothing

Generalizing Stochastic Smoothing for Differentiation and Gradient Estimation

Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization with Optimal Noise Scheduling

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Stochastic Gradient Variance Reduction by Solving a Filtering Problem

Trend-Smooth: Accelerate Asynchronous SGD by Smoothing Parameters Using Parameter Trends

Predictive Local Smoothness for Stochastic Gradient Methods

On discretisation drift and smoothness regularisation in neural network training

Improved Analysis of Clipping Algorithms for Non-convex Optimization

Convergence Analysis of Adaptive Gradient Methods under Refined Smoothness and Noise Assumptions

AdaGrad under Anisotropic Smoothness

Improving Discrete Optimisation Via Decoupled Straight-Through Gumbel-Softmax

Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

Stochastic Average Gradient : A Simple Empirical Investigation