Convergence of Distributed Adaptive Optimization with Local Updates

Ziheng Cheng,Margalit Glasgow
2024-09-20
Abstract:We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial but challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The problem that this paper attempts to solve is, in distributed optimization, the impact of local updates (i.e., intermittent communication) on adaptive optimization algorithms (such as Local Stochastic Gradient Descent with Momentum (Local SGDM) and Local Adam) in terms of reducing communication complexity. Specifically, the goal of the paper is to prove that in convex and weakly convex settings, Local SGDM and Local Adam can more effectively reduce the number of communications compared to their minibatch versions (Minibatch SGDM and Minibatch Adam), thereby improving training efficiency. ### Main Contributions 1. **Theoretical Guarantees**: - The paper provides for the first time the convergence guarantee of Local SGDM in a convex setting and proves that its convergence speed is faster than that of Minibatch SGDM. - The paper also provides the convergence rate of Local Adam in a weakly convex setting and proves that Local Adam can significantly improve communication efficiency under certain conditions. 2. **Technical Contributions**: - A new technique is introduced to prove the contraction property in adaptive methods, especially to handle the consensus error between different worker nodes. - The coordinate clipping mechanism is used to handle unbounded global smoothness and heavy - tailed noise, which are very common in language models. ### Specific Problem Description - **Distributed Optimization Problem**: Consider a distributed optimization problem, where the objective is to minimize the objective function \( f(x)=\mathbb{E}_{\xi \sim D}[F(x; \xi)] \), where \( D \) is the data distribution and \( f \) is the overall loss function. Assume that there are \( M \) parallel worker nodes, with a total of \( R \) rounds of communication and \( T \) gradient computations. - **Local Updates**: Each worker node independently performs \( K \) steps of local updates in each round of communication, and then synchronizes the iteration values and the associated momentum states. - **Adaptive Methods**: The paper focuses on the performance of Local SGDM and Local Adam in convex and weakly convex settings, especially their advantages in reducing communication complexity. ### Main Results - **Local SGDM**: - In a strongly convex setting, the convergence speed of Local SGDM is faster than that of Minibatch SGDM. - In a convex setting, the convergence speed of Local SGDM is also faster than that of Minibatch SGDM, especially in the cases of large \( M \) and large \( K \). - **Local Adam**: - In a weakly convex setting, the convergence speed of Local Adam is faster than that of Minibatch Adam, especially in the cases of large \( M \) and small \( \tau \). ### Technical Details - **Assumption Conditions**: - **Smoothness**: Assume that \( f \) satisfies the generalized smoothness condition on a certain set \( \Omega \). - **Noise Assumption**: Assume that the noise has a bounded \( \alpha \)-moment. - **Proof Methods**: - Use high - probability bounds to prove convergence instead of the traditional expected bounds. - Introduce auxiliary sequences to handle the cumulative stochastic gradients in adaptive methods, thereby proving the contraction property. Through these contributions, the paper provides a solid theoretical basis for understanding the role of local updates in distributed adaptive optimization and provides guidance for optimizing communication efficiency in practical applications.