Abstract:Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov's accelerated gradient descent (Nesterov, 1983, 2004) and Adam (Kingma and Ba, 2014). In order to combine the benefits of communication compression and convergence acceleration, we propose a \emph{compressed and accelerated} gradient method based on ANITA (Li, 2021) for distributed optimization, which we call CANITA. Our CANITA achieves the \emph{first accelerated rate} $O\bigg(\sqrt{\Big(1+\sqrt{\frac{\omega^3}{n}}\Big)\frac{L}{\epsilon}} + \omega\big(\frac{1}{\epsilon}\big)^{\frac{1}{3}}\bigg)$, which improves upon the state-of-the-art non-accelerated rate $O\left((1+\frac{\omega}{n})\frac{L}{\epsilon} + \frac{\omega^2+\omega}{\omega+n}\frac{1}{\epsilon}\right)$ of DIANA (Khaled et al., 2020) for distributed general convex problems, where $\epsilon$ is the target error, $L$ is the smooth parameter of the objective, $n$ is the number of machines/devices, and $\omega$ is the compression parameter (larger $\omega$ means more compression can be applied, and no compression implies $\omega=0$). Our results show that as long as the number of devices $n$ is large (often true in distributed/federated learning), or the compression $\omega$ is not very high, CANITA achieves the faster convergence rate $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$, i.e., the number of communication rounds is $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$ (vs. $O\big(\frac{L}{\epsilon}\big)$ achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

MARINA: Faster Non-Convex Distributed Learning with Compression

On Biased Compression for Distributed Learning

Distributed learning with compressed gradient differences*

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression

Correlated Quantization for Faster Nonconvex Distributed Optimization

Improving the Worst-Case Bidirectional Communication Complexity for Nonconvex Distributed Optimization under Function Similarity

TAMUNA: Doubly Accelerated Distributed Optimization with Local Training, Compression, and Partial Participation

Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Communication Compression for Byzantine Robust Learning: New Efficient Algorithms and Improved Rates

Communication Compression for Distributed Learning without Control Variates

Byzantine-Robust and Communication-Efficient Distributed Learning via Compressed Momentum Filtering

Accelerated Primal-Dual Algorithms for Distributed Smooth Convex Optimization over Networks

Decentralized Deep Learning with Arbitrary Communication Compression

Distributed Algorithms for Composite Optimization: Unified Framework and Convergence Analysis

Preserved central model for faster bidirectional compression in distributed settings

Bidirectional compression in heterogeneous settings for distributed or federated learning with partial participation: tight convergence guarantees

Faster Rates for Compressed Federated Learning with Client-Variance Reduction

Communication Compression for Distributed Nonconvex Optimization

Distributed Learning Systems with First-order Methods