Improved Analysis of Clipping Algorithms for Non-convex Optimization

Bohang Zhang,Jikai Jin,Cong Fang,Liwei Wang
DOI: https://doi.org/10.48550/arXiv.2010.02519
2020-10-29
Abstract:Gradient clipping is commonly used in training deep neural networks partly due to its practicability in relieving the exploding gradient problem. Recently, \citet{zhang2019gradient} show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD via introducing a new assumption called $(L_0, L_1)$-smoothness, which characterizes the violent fluctuation of gradients typically encountered in deep neural networks. However, their iteration complexities on the problem-dependent parameters are rather pessimistic, and theoretical justification of clipping combined with other crucial techniques, e.g. momentum acceleration, are still lacking. In this paper, we bridge the gap by presenting a general framework to study the clipping algorithms, which also takes momentum methods into consideration. We provide convergence analysis of the framework in both deterministic and stochastic setting, and demonstrate the tightness of our results by comparing them with existing lower bounds. Our results imply that the efficiency of clipping methods will not degenerate even in highly non-smooth regions of the landscape. Experiments confirm the superiority of clipping-based methods in deep learning tasks.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
This paper aims to solve the problems of gradient explosion and slow convergence in non - convex optimization problems. Specifically, the paper focuses on minimizing a non - convex function \(F(x)\) in a general form, as shown in equation (1): \[ \min_{x \in \mathbb{R}^d} F(x), \] where \(F(x)\) may be stochastic, that is, \(F(x)=\mathbb{E}_{\xi \sim D}[f(x, \xi)]\). For this form of non - convex optimization problem, finding the global minimum is usually NP - hard, so the paper focuses on finding an \(\varepsilon\)-approximate first - order stationary point such that \(\|\nabla F(x)\| \leq \varepsilon\). The paper proposes a new hypothesis - \((L_0, L_1)\)-smoothness, which is more relaxed than the traditional \(L\)-smoothness hypothesis and can better describe the drastic fluctuations of gradients in deep neural networks. Based on this hypothesis, the paper constructs a general framework to analyze clipping algorithms and combines momentum acceleration techniques. Through this framework, the paper provides convergence analysis in deterministic and stochastic settings and proves the tightness of its results. The main contributions of the paper include: 1. Provide a general framework to analyze the application of clipping techniques in optimizing \((L_0, L_1)\)-smooth functions. This framework includes multiple clipping algorithms, such as gradient clipping and momentum clipping. 2. Provide the convergence analysis of this general framework and prove the tightness of the results by comparing with the existing lower bounds. 3. Verify the superiority of clipping algorithms in multiple tasks through extensive experiments, especially in deep learning tasks. In summary, this paper attempts to solve the problems of gradient explosion and slow convergence in non - convex optimization problems by introducing the new \((L_0, L_1)\)-smoothness hypothesis and constructing a general framework, and verifies the effectiveness of the method through theoretical analysis and experimental verification.