FedRDMA: Communication-Efficient Cross-Silo Federated LLM via Chunked RDMA Transmission

Zeling Zhang,Dongqi Cai,Yiran Zhang,Mengwei Xu,Shangguang Wang,Ao Zhou
2024-03-01
Abstract:Communication overhead is a significant bottleneck in federated learning (FL), which has been exaggerated with the increasing size of AI models. In this paper, we propose FedRDMA, a communication-efficient cross-silo FL system that integrates RDMA into the FL communication protocol. To overcome the limitations of RDMA in wide-area networks (WANs), FedRDMA divides the updated model into chunks and designs a series of optimization techniques to improve the efficiency and robustness of RDMA-based communication. We implement FedRDMA atop the industrial federated learning framework and evaluate it on a real-world cross-silo FL scenario. The experimental results show that \sys can achieve up to 3.8$\times$ speedup in communication efficiency compared to traditional TCP/IP-based FL systems.
Machine Learning,Distributed, Parallel, and Cluster Computing,Networking and Internet Architecture
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the communication overhead problem in Federated Learning (FL), especially in the training of large - scale language models (LLMs) across institutions (cross - silo). As the scale of AI models continues to increase, communication overhead has become a significant bottleneck in Federated Learning, especially in the wide - area network (WANs) environment. Even with high bandwidth, this problem is still very prominent. For example, in the case of using two NVIDIA A800 80G GPUs and 10Gbps bandwidth for full - tuning of the GPT - 2 model, it still takes 45.9 seconds to transmit the model weights per round, accounting for more than 44.97% of the total Federated Learning time. To overcome this challenge, the authors propose FedRDMA, an efficient cross - institutional Federated Learning system based on Remote Direct Memory Access (RDMA) technology. FedRDMA divides the updated model into small chunks and designs a series of optimization techniques to improve the efficiency and robustness of RDMA communication, thereby achieving more efficient model parameter exchange in the WAN environment. Experimental results show that compared with the traditional TCP/IP - based Federated Learning system, FedRDMA can increase the communication efficiency up to 3.8 times. Specifically, the main contributions of the paper are as follows: - Through preliminary experiments, it is shown that even with high - bandwidth and computing resources, cross - institutional Federated Learning still faces high communication overhead problems. - Propose FedRDMA, an efficient cross - institutional Federated Learning system that adopts a chunked RDMA transmission method and combines a series of optimization techniques. - Implement FedRDMA and conduct extensive experiments on the industrial - level Federated Learning framework FATE, verifying that it can reduce the communication time by up to 3.8 times. In addition, the paper also explores the influence of different hyper - parameters on the performance of FedRDMA and how to combine it with the Parameter - Efficient Fine - Tuning (PEFT) method to further improve the communication efficiency. Overall, FedRDMA aims to solve the communication bottleneck in the WAN environment by using RDMA technology, thereby accelerating the cross - institutional Federated Learning of large - scale language models.