Optimizing Communication Topology for Collaborative Learning Across Datacenters

Long Luo,Shulin Yang,Wenjiao Feng,Hongfang Yu,Gang Sun,Bo Lei
DOI: https://doi.org/10.1007/978-981-19-9697-9_15
2023-01-01
Abstract:Federated learning (FL) is emerging as an increasingly important and popular paradigm for collaboratively training high-quality machine learning (ML) models over massive amounts of data stored by geo-distributed datacenters. However, the communication efficiency of gradient aggregation during the training process comes as a primary bottleneck that impedes the adoption of FL, especially in cross-silo settings, as the available bandwidth of inter-datacenter links connecting data silos is often very limited. To improve the training efficiency of cross-silo FL between datacenters, we propose TOPOADOPT, an efficient communication topology design for gradient aggregation to overcome the communication bottleneck of cross-silo model training. TOPOADOPT uses multiple aggregators to share aggregation load and tree-based hierarchical aggregation to reduce bandwidth consumption from clients to aggregators. For better performance, it jointly optimizes the parameter assignment among aggregators and the construction of aggregation trees. We formulate this optimization problem as a mixed-integer nonlinear programming model and develop efficient algorithms to find satisfactory communication topologies in reasonable computational time. The experimental results show that TOPOADOPT achieves significant speedup, up to 5.2x, in gradient aggregation completion time compared to existing solutions.
What problem does this paper attempt to address?