InGo: In-Network Aggregation Routing with Batch Size Adjustment for Distributed Training

Jianfeng Bao,Gongming Zhao,Hongli Xu,Haibo Wang,Peng Yang
DOI: https://doi.org/10.1109/iwqos61813.2024.10682850
2024-01-01
Abstract:Distributed training has emerged as a critical application in clusters due to the widespread adoption of AI technology across various domains. However, as distributed training continues to advance, it has become increasingly time-consuming. To address this challenge, researchers have explored leveraging In-Network Aggregation (INA) to expedite distributed model training. Specifically, by harnessing programmable hardware, such as Intel Tofino switches, INA can aggregate gradients within the network, thereby reducing the amount of gradient transmission and accelerating distributed training. However, previous works assume fixed routing selection and batch size, ignoring their impact on model convergence and resulting in extended completion time. To bridge this gap, we propose InGo, a pioneering approach that considers both in-network aggregation routing and batch size adjustment, and provide the rigorous convergence analysis. Then, we formally define the problem of in-network aggregation routing with batch size adjustment, and present an efficient algorithm with bounded approximation factors to solve this problem. Through extensive experiments on both physical platforms and simulated environments, we demonstrate that InGo significantly reduces the completion time by 25.2%-74.7% compared to state-of-the-art solutions.
What problem does this paper attempt to address?