On the Convergence of Decentralized Adaptive Gradient Methods

Xiangyi Chen,Belhal Karimi,Weijie Zhao,Ping Li
DOI: https://doi.org/10.48550/arXiv.2109.03194
2021-09-08
Abstract:Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In this paper, we introduce novel convergent decentralized adaptive gradient methods and rigorously incorporate adaptive gradient methods into decentralized training procedures. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. We illustrate the benefit of our generic decentralized framework on a prototype method, i.e., AMSGrad, both theoretically and numerically.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to introduce adaptive gradient methods (such as Adam, AdaGrad, etc.) into the decentralized training paradigm and ensure the convergence of these methods in a decentralized environment. Specifically, the paper focuses on the following points: 1. **Communication Cost**: With the improvement of computing power, the communication time in distributed training has gradually become a bottleneck. Therefore, reducing communication cost is an important research direction for distributed optimization algorithms. 2. **Decentralized Training**: Although the traditional parameter - server setup is effective, the communication cost is still high on large - scale decentralized devices. The decentralized training paradigm can effectively reduce the communication cost by removing the parameter server and allowing each node to communicate only with its neighbors. 3. **Convergence of Adaptive Gradient Methods**: Although adaptive gradient methods perform well in centralized training, their convergence in a decentralized environment has not been fully studied. The paper attempts to fill this gap and proposes a general algorithm framework that can convert existing adaptive gradient methods into decentralized versions and prove their convergence. ### Specific Contributions of the Paper 1. **General Conversion Technique**: The paper proposes a general technique that can convert centralized adaptive gradient methods into decentralized versions and emphasizes the importance of adaptive learning rate consensus. 2. **New Algorithm**: Based on the proposed conversion technique, the paper develops a new decentralized optimization algorithm - Decentralized AMSGrad, and provides theoretical and numerical experiments to verify its effectiveness. 3. **Theoretical Analysis**: The paper provides a theoretical verification interface for analyzing the behavior of decentralized adaptive gradient methods obtained through the conversion technique, especially proving the convergence rate of Decentralized AMSGrad for the first time. 4. **Divergence Problem of DADAM**: The paper shows through a specific example that existing decentralized adaptive methods (such as DADAM) may diverge in an off - line setting, further highlighting the importance of studying adaptive learning rate consensus. ### Formula Representation The formulas involved in the paper are represented in Markdown format as follows: - Minimization problem of the loss function: \[ \min_{x \in \mathbb{R}^d} \frac{1}{N} \sum_{i = 1}^N f_i(x) \] - Lipschitz continuity assumption of the gradient: \[ \|\nabla f_i(x)-\nabla f_i(y)\|\leq L\|x - y\| \] - Adaptive learning rate update rule: \[ \hat{v}_{t,i}=r_t(g_{1,i},\dots,g_{t,i}) \] \[ u_{t,i}=\max(\tilde{u}_t,\epsilon) \] - Key inequality in convergence analysis: \[ \frac{1}{T} \sum_{t = 1}^T \mathbb{E}\left[\left\|\frac{\nabla f(X_t)}{U_t^{1/4}}\right\|^2\right]\leq C_1\left(\frac{1}{T\alpha}(E[f(Z_1)]-\min_x f(x))+\frac{\alpha d\sigma^2}{N}\right)+C_2\alpha^2 d + C_3\alpha^3 d+\frac{1}{T\sqrt{N}}(C_4 + C_5\alpha)E\left[\sum_{t = 1}^T \|(-\hat{V}_{t - 2}+\hat{V}_{t - 1})\|_{abs}\right] \] Through these contributions, the paper provides an important theoretical basis and practical application guidance for the research of decentralized adaptive gradient methods.