Abstract:We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios. This algorithm combines the standard stochastic gradient descent and the gradient clipping method. The output layer is updated using clipped gradients, the rest of the neural network is updated using standard gradients. Updating the output layer using clipped gradient stabilizes it. We show that the remaining layers are automatically stabilized provided the neural network is only composed of squashing (compact range) activations. We also present a novel squashing activation function - it is obtained by modifying a Gaussian Error Linear Unit (GELU) to have compact range - we call it Truncated GELU (tGELU). Unlike other squashing activations, such as sigmoid, the range of tGELU can be explicitly specified. As a consequence, the problem of vanishing gradients that arise due to a small range, e.g., in the case of a sigmoid activation, is eliminated. We prove that a NN composed of squashing activations (tGELU, sigmoid, etc.), when updated using the algorithm presented herein, is numerically stable and has consistent performance (low variance). The theory is supported by extensive experiments. Within reinforcement learning, as a consequence of our study, we show that target networks in Deep Q-Learning can be omitted, greatly speeding up learning and alleviating memory requirements. Cross-entropy based classification algorithms that suffer from high variance issues are more consistent when trained using our framework. One symptom of numerical instability in training is the high variance of the neural network update values. We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

UniGrad-FS: Unified Gradient Projection with Flatter Sharpness for Continual Learning

Improved analysis of clipping algorithms for non-convex optimization

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise

High Probability Analysis for Non-Convex Stochastic Optimization with Clipping

From Gradient Clipping to Normalization for Heavy Tailed SGD

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed

A Framework for Provably Stable and Consistent Training of Deep Feedforward Networks

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tails

Neural Gradient Regularizer

Batch Clipping and Adaptive Layerwise Clipping for Differential Private Stochastic Gradient Descent

DP-SGD with weight clipping

Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning.

Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning