Efficient Inter-Datacenter AllReduce With Multiple Trees

Shouxi Luo,Renyi Wang,Huanlai Xing
DOI: https://doi.org/10.1109/tnse.2024.3419030
IF: 6.6
2024-08-18
IEEE Transactions on Network Science and Engineering
Abstract:In this paper, we look into the problem of achieving efficient inter-datacenter AllReduce operations for geo-distributed machine learning (Geo-DML). Compared with intra-datacenter distributed training, the heterogeneous wide-area network (WAN) connections among Geo-DML workers are scarce, expensive, and unstable, making existing proposals designed for homogeneous networks fall short. Despite that some recent optimizations have been proposed for Geo-DML, they break the consistency semantics of bulk synchronous parallel (BSP), thus bringing no benefit to the widely existing BSP-based applications. To address these issues, we propose mTree, a topology management suite for Geo-DML. With the global view of the heterogeneous WAN connections, mTree builds multiple optimized spanning trees along with suggested workload distribution proportions, respecting the constraints of both the number of trees and their maximum height specified by the training. Based on these results, geo-distributed workers could launch concurrent tree-based pipelined AllReduce operations to make efficient use of the heterogeneous network. Detailed performance studies on real-world network topologies imply that mTree achieves efficient AllReduce, significantly outperforming existing solutions.
engineering, multidisciplinary,mathematics, interdisciplinary applications
What problem does this paper attempt to address?