AggTree: A Routing Tree with In-Network Aggregation for Distributed Training

Jianglong Nie,Wenfei Wu
DOI: https://doi.org/10.1109/ipccc59175.2023.10253874
2023-01-01
Abstract:For distributed training (DT) based on the parameter servers (PS) architecture, the communication overhead is huge in the network for servers synchronizing parameters. In the PS architecture, the workers send gradients over the network to PS for aggregation. With the development of programmable switches, in-network aggregation (INA) is proposed to accelerate distributed training by utilizing the programmable switches in the network to implement gradients aggregation, not only at PS. However, the existing routing methods can not fully utilize the capability of INA, resulting in load imbalance and long communication time. This paper analyzes and models the routing problem in INA under the constraint of network resources. And we propose a routing algorithm named AggTree to solve this problem by searching the high-rate routing path. The result of simulations shows that AggTree can reduce communication time by 4.1%-37.9% for a single DT job and 12.7%-74.0% for multiple DT jobs compared with state-of-the-art solutions.
What problem does this paper attempt to address?