Abstract:Adaptive gradient methods including Adam, AdaGrad, and their variants have been very successful for training deep learning models, such as neural networks. Meanwhile, given the need for distributed computing, distributed optimization algorithms are rapidly becoming a focal point. With the growth of computing power and the need for using machine learning models on mobile devices, the communication cost of distributed training algorithms needs careful consideration. In this paper, we introduce novel convergent decentralized adaptive gradient methods and rigorously incorporate adaptive gradient methods into decentralized training procedures. Specifically, we propose a general algorithmic framework that can convert existing adaptive gradient methods to their decentralized counterparts. In addition, we thoroughly analyze the convergence behavior of the proposed algorithmic framework and show that if a given adaptive gradient method converges, under some specific conditions, then its decentralized counterpart is also convergent. We illustrate the benefit of our generic decentralized framework on a prototype method, i.e., AMSGrad, both theoretically and numerically.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to introduce adaptive gradient methods (such as Adam, AdaGrad, etc.) into the decentralized training paradigm and ensure the convergence of these methods in a decentralized environment. Specifically, the paper focuses on the following points: 1. **Communication Cost**: With the improvement of computing power, the communication time in distributed training has gradually become a bottleneck. Therefore, reducing communication cost is an important research direction for distributed optimization algorithms. 2. **Decentralized Training**: Although the traditional parameter - server setup is effective, the communication cost is still high on large - scale decentralized devices. The decentralized training paradigm can effectively reduce the communication cost by removing the parameter server and allowing each node to communicate only with its neighbors. 3. **Convergence of Adaptive Gradient Methods**: Although adaptive gradient methods perform well in centralized training, their convergence in a decentralized environment has not been fully studied. The paper attempts to fill this gap and proposes a general algorithm framework that can convert existing adaptive gradient methods into decentralized versions and prove their convergence. ### Specific Contributions of the Paper 1. **General Conversion Technique**: The paper proposes a general technique that can convert centralized adaptive gradient methods into decentralized versions and emphasizes the importance of adaptive learning rate consensus. 2. **New Algorithm**: Based on the proposed conversion technique, the paper develops a new decentralized optimization algorithm - Decentralized AMSGrad, and provides theoretical and numerical experiments to verify its effectiveness. 3. **Theoretical Analysis**: The paper provides a theoretical verification interface for analyzing the behavior of decentralized adaptive gradient methods obtained through the conversion technique, especially proving the convergence rate of Decentralized AMSGrad for the first time. 4. **Divergence Problem of DADAM**: The paper shows through a specific example that existing decentralized adaptive methods (such as DADAM) may diverge in an off - line setting, further highlighting the importance of studying adaptive learning rate consensus. ### Formula Representation The formulas involved in the paper are represented in Markdown format as follows: - Minimization problem of the loss function: \[ \min_{x \in \mathbb{R}^d} \frac{1}{N} \sum_{i = 1}^N f_i(x) \] - Lipschitz continuity assumption of the gradient: \[ \|\nabla f_i(x)-\nabla f_i(y)\|\leq L\|x - y\| \] - Adaptive learning rate update rule: \[ \hat{v}_{t,i}=r_t(g_{1,i},\dots,g_{t,i}) \] \[ u_{t,i}=\max(\tilde{u}_t,\epsilon) \] - Key inequality in convergence analysis: \[ \frac{1}{T} \sum_{t = 1}^T \mathbb{E}\left[\left\|\frac{\nabla f(X_t)}{U_t^{1/4}}\right\|^2\right]\leq C_1\left(\frac{1}{T\alpha}(E[f(Z_1)]-\min_x f(x))+\frac{\alpha d\sigma^2}{N}\right)+C_2\alpha^2 d + C_3\alpha^3 d+\frac{1}{T\sqrt{N}}(C_4 + C_5\alpha)E\left[\sum_{t = 1}^T \|(-\hat{V}_{t - 2}+\hat{V}_{t - 1})\|_{abs}\right] \] Through these contributions, the paper provides an important theoretical basis and practical application guidance for the research of decentralized adaptive gradient methods.

On the Convergence of Decentralized Adaptive Gradient Methods

Toward Communication Efficient Adaptive Gradient Method

Convergence of Asynchronous Distributed Gradient Methods over Stochastic Networks

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Decentralized SGD with Asynchronous, Local and Quantized Updates

On the Convergence of Decentralized Gradient Descent.

Multi-consensus decentralized accelerated gradient descent

Adjacent Leader Decentralized Stochastic Gradient Descent

Decentralized Stochastic Subgradient Methods for Nonsmooth Nonconvex Optimization

Convergence of Distributed Adaptive Optimization with Local Updates

Convergence in High Probability of Distributed Stochastic Gradient Descent Algorithms

Faster Adaptive Decentralized Learning Algorithms

Quantized Adaptive Subgradient Algorithms and Their Applications

On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization

Peering Beyond the Gradient Veil with Distributed Auto Differentiation

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

Adaptive Random Walk Gradient Descent for Decentralized Optimization.

CADA: Communication-Adaptive Distributed Adam

Local Methods with Adaptivity via Scaling

Distributed Adaptive Subgradient Algorithms for Online Learning over Time-Varying Networks