Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

Zhijie Chen,Qiaobo Li,Arindam Banerjee
2024-11-11
Abstract:Combining gradient compression methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and less per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to have a linear dependence on the ambient dimension, i.e., number of parameters, which is prohibitively high for modern deep learning models. In this work, we introduce specific sketched adaptive federated learning (SAFL) algorithms and, as our main contribution, provide theoretical convergence analyses in different FL settings with guarantees on communication cost depending only logarithmically (instead of linearly) on the ambient dimension. Unlike existing analyses, we show that the entry-wise sketching noise existent in the preconditioners and the first moments of SAFL can be implicitly addressed by leveraging the recently-popularized anisotropic curvatures in deep learning losses, e.g., fast decaying loss Hessian eigen-values. In the i.i.d. client setting of FL, we show that SAFL achieves asymptotic $O(1/\sqrt{T})$ convergence, and converges faster in the initial epochs. In the non-i.i.d. client setting, where non-adaptive methods lack convergence guarantees, we show that SACFL (SAFL with clipping) algorithms can provably converge in spite of the additional heavy-tailed noise. Our theoretical claims are supported by empirical studies on vision and language tasks, and in both fine-tuning and training-from-scratch regimes. Surprisingly, as a by-product of our analysis, the proposed SAFL methods are competitive with the state-of-the-art communication-efficient federated learning algorithms based on error feedback.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the communication cost issue in Federated Learning (FL). Specifically, the authors aim to combine gradient compression methods (such as CountSketch, quantization, etc.) and adaptive optimizers (such as Adam, AMSGrad, etc.) to reduce the number of communication rounds and the amount of communication per round, thereby improving the communication efficiency of federated deep learning. ### Main problems: 1. **High communication cost**: In existing federated learning methods for modern deep - learning models, the communication complexity is \(O(dT)\), where \(d\) is the dimension of the parameter space and \(T\) is the number of rounds required for convergence. For modern deep - learning models, \(d\) is very large, resulting in excessively high communication costs. 2. **Insufficient theoretical analysis**: Although the method of combining gradient compression and adaptive optimizers has achieved initial success in experiments, existing theoretical analyses show that the communication cost has a linear relationship with the parameter dimension \(d\), which is unacceptable for modern deep - learning models. ### Main contributions of the paper: - Proposed a new class of algorithm frameworks - **Sketched Adaptive Federated Learning (SAFL)**, which combines stochastic sketching techniques and adaptive optimizers, can provide theoretical convergence guarantees under different federated learning settings, and the communication cost only depends on the logarithm of \(d\), not linearly. - By leveraging the anisotropic curvature structure of deep - learning loss functions (such as rapidly decaying Hessian eigenvalues), the element - wise sketch noise problem in adaptive optimizers is solved. - Under the independent and identically distributed (i.i.d.) and non - independent and identically distributed (non - i.i.d.) client data settings, the convergence of the SAFL algorithm is proved, and its faster convergence speed in the initial rounds is demonstrated. - For non - i.i.d. data with heavy - tailed noise, the SAFL algorithm with clipping (SACFL) is proposed, and its optimal convergence rate under \(\alpha\)-order moment noise is proved. ### Specific implementation: - **Algorithm framework**: The SAFL algorithm uses an unbiased gradient estimator in each communication round and avoids additional server - side compression rounds. It projects the gradient onto a low - dimensional subspace through stochastic sketching techniques, thereby reducing the amount of communication per round. - **Theoretical analysis**: By introducing high - probability bounds, it is proved that in a non - convex deep - learning setting, a sketch size of \(b = O(\log d)\) is sufficient to achieve an asymptotic \(O(1/\sqrt{T})\) convergence rate. - **Experimental verification**: Empirical studies are carried out in vision (ResNet, Vision Transformer) and language (BERT) tasks to verify the effectiveness of the SAFL algorithm and demonstrate its performance comparable to that of full - dimensional sketch - free adaptive optimizers. Overall, this paper effectively solves the problem of excessively high communication costs in federated deep learning by proposing the SAFL algorithm and provides strict theoretical guarantees and empirical support.