Abstract:We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial but challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is, in distributed optimization, the impact of local updates (i.e., intermittent communication) on adaptive optimization algorithms (such as Local Stochastic Gradient Descent with Momentum (Local SGDM) and Local Adam) in terms of reducing communication complexity. Specifically, the goal of the paper is to prove that in convex and weakly convex settings, Local SGDM and Local Adam can more effectively reduce the number of communications compared to their minibatch versions (Minibatch SGDM and Minibatch Adam), thereby improving training efficiency. ### Main Contributions 1. **Theoretical Guarantees**: - The paper provides for the first time the convergence guarantee of Local SGDM in a convex setting and proves that its convergence speed is faster than that of Minibatch SGDM. - The paper also provides the convergence rate of Local Adam in a weakly convex setting and proves that Local Adam can significantly improve communication efficiency under certain conditions. 2. **Technical Contributions**: - A new technique is introduced to prove the contraction property in adaptive methods, especially to handle the consensus error between different worker nodes. - The coordinate clipping mechanism is used to handle unbounded global smoothness and heavy - tailed noise, which are very common in language models. ### Specific Problem Description - **Distributed Optimization Problem**: Consider a distributed optimization problem, where the objective is to minimize the objective function \( f(x)=\mathbb{E}_{\xi \sim D}[F(x; \xi)] \), where \( D \) is the data distribution and \( f \) is the overall loss function. Assume that there are \( M \) parallel worker nodes, with a total of \( R \) rounds of communication and \( T \) gradient computations. - **Local Updates**: Each worker node independently performs \( K \) steps of local updates in each round of communication, and then synchronizes the iteration values and the associated momentum states. - **Adaptive Methods**: The paper focuses on the performance of Local SGDM and Local Adam in convex and weakly convex settings, especially their advantages in reducing communication complexity. ### Main Results - **Local SGDM**: - In a strongly convex setting, the convergence speed of Local SGDM is faster than that of Minibatch SGDM. - In a convex setting, the convergence speed of Local SGDM is also faster than that of Minibatch SGDM, especially in the cases of large \( M \) and large \( K \). - **Local Adam**: - In a weakly convex setting, the convergence speed of Local Adam is faster than that of Minibatch Adam, especially in the cases of large \( M \) and small \( \tau \). ### Technical Details - **Assumption Conditions**: - **Smoothness**: Assume that \( f \) satisfies the generalized smoothness condition on a certain set \( \Omega \). - **Noise Assumption**: Assume that the noise has a bounded \( \alpha \)-moment. - **Proof Methods**: - Use high - probability bounds to prove convergence instead of the traditional expected bounds. - Introduce auxiliary sequences to handle the cumulative stochastic gradients in adaptive methods, thereby proving the contraction property. Through these contributions, the paper provides a solid theoretical basis for understanding the role of local updates in distributed adaptive optimization and provides guidance for optimizing communication efficiency in practical applications.

Convergence of Distributed Adaptive Optimization with Local Updates

The Limits and Potentials of Local SGD for Distributed Heterogeneous Learning with Intermittent Communication

Local Methods with Adaptivity via Scaling

Communication-Efficient Local Decentralized SGD Methods

On the Convergence of Decentralized Adaptive Gradient Methods

A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

Communication-Efficient Adaptive Batch Size Strategies for Distributed Local Gradient Methods

Federated Minimax Optimization: Improved Convergence Analyses and Algorithms

Distributed Learning with Convex SUM-of -Non-convex Objective

Communication Efficient Decentralized Training with Multiple Local Updates.

Decentralized SGD with Asynchronous, Local and Quantized Updates

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Optimization

Distributed Adaptive Newton Methods with Globally Superlinear Convergence

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration.

SLowcal-SGD: Slow Query Points Improve Local-SGD for Stochastic Convex Optimization

Asynchronous Decentralized SGD with Quantized and Local Updates.

Global Optimality in Distributed Low-rank Matrix Factorization

The Effectiveness of Local Updates for Decentralized Learning under Data Heterogeneity

Convergence in High Probability of Distributed Stochastic Gradient Descent Algorithms

Can We Learn Communication-Efficient Optimizers?