Abstract:Combining gradient compression methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and less per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to have a linear dependence on the ambient dimension, i.e., number of parameters, which is prohibitively high for modern deep learning models. In this work, we introduce specific sketched adaptive federated learning (SAFL) algorithms and, as our main contribution, provide theoretical convergence analyses in different FL settings with guarantees on communication cost depending only logarithmically (instead of linearly) on the ambient dimension. Unlike existing analyses, we show that the entry-wise sketching noise existent in the preconditioners and the first moments of SAFL can be implicitly addressed by leveraging the recently-popularized anisotropic curvatures in deep learning losses, e.g., fast decaying loss Hessian eigen-values. In the i.i.d. client setting of FL, we show that SAFL achieves asymptotic $O(1/\sqrt{T})$ convergence, and converges faster in the initial epochs. In the non-i.i.d. client setting, where non-adaptive methods lack convergence guarantees, we show that SACFL (SAFL with clipping) algorithms can provably converge in spite of the additional heavy-tailed noise. Our theoretical claims are supported by empirical studies on vision and language tasks, and in both fine-tuning and training-from-scratch regimes. Surprisingly, as a by-product of our analysis, the proposed SAFL methods are competitive with the state-of-the-art communication-efficient federated learning algorithms based on error feedback.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the communication cost issue in Federated Learning (FL). Specifically, the authors aim to combine gradient compression methods (such as CountSketch, quantization, etc.) and adaptive optimizers (such as Adam, AMSGrad, etc.) to reduce the number of communication rounds and the amount of communication per round, thereby improving the communication efficiency of federated deep learning. ### Main problems: 1. **High communication cost**: In existing federated learning methods for modern deep - learning models, the communication complexity is $O(dT)$, where $d$ is the dimension of the parameter space and $T$ is the number of rounds required for convergence. For modern deep - learning models, $d$ is very large, resulting in excessively high communication costs. 2. **Insufficient theoretical analysis**: Although the method of combining gradient compression and adaptive optimizers has achieved initial success in experiments, existing theoretical analyses show that the communication cost has a linear relationship with the parameter dimension $d$, which is unacceptable for modern deep - learning models. ### Main contributions of the paper: - Proposed a new class of algorithm frameworks - **Sketched Adaptive Federated Learning (SAFL)**, which combines stochastic sketching techniques and adaptive optimizers, can provide theoretical convergence guarantees under different federated learning settings, and the communication cost only depends on the logarithm of $d$, not linearly. - By leveraging the anisotropic curvature structure of deep - learning loss functions (such as rapidly decaying Hessian eigenvalues), the element - wise sketch noise problem in adaptive optimizers is solved. - Under the independent and identically distributed (i.i.d.) and non - independent and identically distributed (non - i.i.d.) client data settings, the convergence of the SAFL algorithm is proved, and its faster convergence speed in the initial rounds is demonstrated. - For non - i.i.d. data with heavy - tailed noise, the SAFL algorithm with clipping (SACFL) is proposed, and its optimal convergence rate under $\alpha$-order moment noise is proved. ### Specific implementation: - **Algorithm framework**: The SAFL algorithm uses an unbiased gradient estimator in each communication round and avoids additional server - side compression rounds. It projects the gradient onto a low - dimensional subspace through stochastic sketching techniques, thereby reducing the amount of communication per round. - **Theoretical analysis**: By introducing high - probability bounds, it is proved that in a non - convex deep - learning setting, a sketch size of $b = O(\log d)$ is sufficient to achieve an asymptotic $O(1/\sqrt{T})$ convergence rate. - **Experimental verification**: Empirical studies are carried out in vision (ResNet, Vision Transformer) and language (BERT) tasks to verify the effectiveness of the SAFL algorithm and demonstrate its performance comparable to that of full - dimensional sketch - free adaptive optimizers. Overall, this paper effectively solves the problem of excessively high communication costs in federated deep learning by proposing the SAFL algorithm and provides strict theoretical guarantees and empirical support.

Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

FedNS: A Fast Sketching Newton-Type Algorithm for Federated Learning

Adaptive Batchsize Selection and Gradient Compression for Wireless Federated Learning

Federated Adversarial Learning: A Framework with Convergence Analysis

Enhancing Convergence in Federated Learning: A Contribution-Aware Asynchronous Approach

Layer-wise and Dimension-wise Locally Adaptive Federated Learning

Understanding the Training Dynamics in Federated Deep Learning via Aggregation Weight Optimization

Communication-Efficient Federated Learning: A Variance-Reduced Stochastic Approach with Adaptive Sparsification.

Towards Communication-efficient Federated Learning via Sparse and Aligned Adaptive Optimization

FLeNS: Federated Learning with Enhanced Nesterov-Newton Sketch

Communication-Efficient Zeroth-Order Adaptive Optimization for Federated Learning

Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach

FLAS: Computation and Communication Efficient Federated Learning via Adaptive Sampling

Decentralized Federated Learning: Balancing Communication and Computing Costs

On the Convergence of Communication-Efficient Local SGD for Federated Learning

Preconditioned Federated Learning

Decentralized Sporadic Federated Learning: A Unified Algorithmic Framework with Convergence Guarantees

DSFedCon: Dynamic Sparse Federated Contrastive Learning for Data-Driven Intelligent Systems

FADAS: Towards Federated Adaptive Asynchronous Optimization

Achieving Linear Speedup in Asynchronous Federated Learning with Heterogeneous Clients

On the Convergence of Heterogeneous Federated Learning with Arbitrary Adaptive Online Model Pruning