Abstract:Gradient clipping is commonly used in training deep neural networks partly due to its practicability in relieving the exploding gradient problem. Recently, \citet{zhang2019gradient} show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD via introducing a new assumption called $(L_0, L_1)$-smoothness, which characterizes the violent fluctuation of gradients typically encountered in deep neural networks. However, their iteration complexities on the problem-dependent parameters are rather pessimistic, and theoretical justification of clipping combined with other crucial techniques, e.g. momentum acceleration, are still lacking. In this paper, we bridge the gap by presenting a general framework to study the clipping algorithms, which also takes momentum methods into consideration. We provide convergence analysis of the framework in both deterministic and stochastic setting, and demonstrate the tightness of our results by comparing them with existing lower bounds. Our results imply that the efficiency of clipping methods will not degenerate even in highly non-smooth regions of the landscape. Experiments confirm the superiority of clipping-based methods in deep learning tasks.

What problem does this paper attempt to address?

This paper aims to solve the problems of gradient explosion and slow convergence in non - convex optimization problems. Specifically, the paper focuses on minimizing a non - convex function $F(x)$ in a general form, as shown in equation (1): \[ \min_{x \in \mathbb{R}^d} F(x), \] where $F(x)$ may be stochastic, that is, $F(x)=\mathbb{E}_{\xi \sim D}[f(x, \xi)]$. For this form of non - convex optimization problem, finding the global minimum is usually NP - hard, so the paper focuses on finding an $\varepsilon$-approximate first - order stationary point such that $\|\nabla F(x)\| \leq \varepsilon$. The paper proposes a new hypothesis - $(L_0, L_1)$-smoothness, which is more relaxed than the traditional $L$-smoothness hypothesis and can better describe the drastic fluctuations of gradients in deep neural networks. Based on this hypothesis, the paper constructs a general framework to analyze clipping algorithms and combines momentum acceleration techniques. Through this framework, the paper provides convergence analysis in deterministic and stochastic settings and proves the tightness of its results. The main contributions of the paper include: 1. Provide a general framework to analyze the application of clipping techniques in optimizing $(L_0, L_1)$-smooth functions. This framework includes multiple clipping algorithms, such as gradient clipping and momentum clipping. 2. Provide the convergence analysis of this general framework and prove the tightness of the results by comparing with the existing lower bounds. 3. Verify the superiority of clipping algorithms in multiple tasks through extensive experiments, especially in deep learning tasks. In summary, this paper attempts to solve the problems of gradient explosion and slow convergence in non - convex optimization problems by introducing the new $(L_0, L_1)$-smoothness hypothesis and constructing a general framework, and verifies the effectiveness of the method through theoretical analysis and experimental verification.

Improved Analysis of Clipping Algorithms for Non-convex Optimization

Improved analysis of clipping algorithms for non-convex optimization

High Probability Analysis for Non-Convex Stochastic Optimization with Clipping

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-Dimensions

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tails

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Methods for Convex $(L_0,L_1)$-Smooth Optimization: Clipping, Acceleration, and Adaptivity

Nonconvex Stochastic Optimization under Heavy-Tailed Noises: Optimal Convergence without Gradient Clipping

On the Convergence of DP-SGD with Adaptive Clipping

Gradient Clipping Improves AdaGrad when the Noise Is Heavy-Tailed

Variance-reduced Clipping for Non-convex Optimization

From Gradient Clipping to Normalization for Heavy Tailed SGD

Algorithms with Gradient Clipping for Stochastic Optimization with Heavy-Tailed Noise

Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Convergence and Privacy of Decentralized Nonconvex Optimization with Gradient Clipping and Communication Compression

Smoothed Gradient Clipping and Error Feedback for Decentralized Optimization under Symmetric Heavy-Tailed Noise

Clip21: Error Feedback for Gradient Clipping