Adam-family Methods with Decoupled Weight Decay in Deep Learning

Kuangyu Ding,Nachuan Xiao,Kim-Chuan Toh
2023-10-13
Abstract:In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural networks with weight decay. Motivated by the AdamW method, we propose a novel framework for Adam-family methods with decoupled weight decay. Within our framework, the estimators for the first-order and second-order moments of stochastic subgradients are updated independently of the weight decay term. Under mild assumptions and with non-diminishing stepsizes for updating the primary optimization variables, we establish the convergence properties of our proposed framework. In addition, we show that our proposed framework encompasses a wide variety of well-known Adam-family methods, hence offering convergence guarantees for these methods in the training of nonsmooth neural networks. More importantly, we show that our proposed framework asymptotically approximates the SGD method, thereby providing an explanation for the empirical observation that decoupled weight decay enhances generalization performance for Adam-family methods. As a practical application of our proposed framework, we propose a novel Adam-family method named Adam with Decoupled Weight Decay (AdamD), and establish its convergence properties under mild conditions. Numerical experiments demonstrate that AdamD outperforms Adam and is comparable to AdamW, in the aspects of both generalization performance and efficiency.
Optimization and Control,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the convergence and performance issues of Adam - family optimization methods in training non - smooth neural networks in deep learning. Specifically: 1. **Limitations of Existing Methods**: - Most existing Adam - family methods (such as Adam, AMSGrad, etc.) are theoretically based only on the assumption that the objective function \( f \) is continuously differentiable to establish convergence properties. However, in practical applications, many neural networks use non - smooth activation functions (such as ReLU), resulting in the loss function usually being non - smooth and lacking Clarke regularity. - These existing methods cannot provide convergence guarantees for training non - smooth neural networks. - The Adam method is coupled with the weight decay term, which can lead to poor generalization performance. 2. **Research Motivation**: - Inspired by the AdamW method, the author proposes a new framework (AFMDW) with decoupled weight decay. This new framework updates the weight decay term separately from the first - and second - order moment estimators, thus improving the generalization performance. - Although the existing AdamW method performs well, its convergence in non - smooth cases has not been fully studied. 3. **Main Problems**: - Can an Adam - family method with decoupled weight decay be designed so that it still has convergence guarantees under non - decreasing step sizes, especially when training non - smooth neural networks? - By introducing decoupled weight decay, can it be explained why this improvement can improve the generalization performance? ### Specific Content - **Form of Optimization Problem**: The paper considers the following unconstrained stochastic optimization problem: \[ \min_{x\in\mathbb{R}^n}g(x): = f(x)+\frac{\sigma}{2}\|x\|^2, \] where \( f:\mathbb{R}^n\rightarrow\mathbb{R} \) is a locally Lipschitz continuous and possibly non - smooth function, and \(\sigma > 0\) is the penalty parameter of the quadratic regularization term. - **Existing Challenges**: - Non - smooth activation functions (such as ReLU) cause the loss function to be non - smooth, making it difficult for existing gradient descent methods to find critical points. - Automatic differentiation algorithms (AD) may produce output results that are not in the Clarke sub - differential when dealing with non - smooth neural networks. - **Proposed Solution**: - A new Adam - family method framework (AFMDW) is proposed, in which the weight decay term is updated independently of the first - and second - order moment estimators: \[ \begin{cases} g_k = d_k+\xi_{k + 1},\\ m_{k+1}=(1-\theta_k)m_k+\theta_kg_k,\\ v_{k+1}=(1-\rho_k)v_k+\rho_k(g_k)^2,\\ x_{k+1}=x_k-\eta_kH(v_{k+1})\odot(m_{k+1}+\sigma x_k). \end{cases} \] - **Theoretical Contributions**: - Under the non - decreasing step - size condition, the convergence of the framework (AFMDW) is proved. - The framework (AFMDW) covers a wide range of Adam - family methods and provides convergence guarantees for these methods in training non - smooth neural networks. - It is proved that the framework (AFMDW) asymptotically approximates the SGD method, explaining why decoupled weight decay can improve the generalization performance. - **Experimental Verification**: - A new Adam - family method - AdamD is proposed, and its performance is verified through image classification and language modeling tasks. - The experimental results show that AdamD is superior to Adam in image classification tasks and is comparable to AdamW, and it performs better than AdamW in language modeling tasks. In summary, this paper aims to solve the above problems by introducing a new framework of decoupled weight decay.