XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient Aggregation

Qianyu Zhang,Gongming Zhao,Hongli Xu,Peng Yang
DOI: https://doi.org/10.1109/tnet.2023.3339524
2024-01-01
Abstract:With the growth of model/dataset/system size for distributed model training in datacenters, the widely used Parameter Server (PS) architecture suffers from communication bottleneck of gradient transmission. Recent works attempt to utilize programmable switches to implement in-network gradient aggregation and alleviate communication bottlenecks on PSs. Due to the limited on-chip memory of programmable switches, gradient transmission requires strict synchronization to achieve ideal aggregation performance. However, the distributed training system is usually heterogeneous in datacenters (e.g., computation and bandwidth heterogeneity), and the gradient will reach the aggregation nodes asynchronously, thereby seriously affecting the aggregation performance. To solve the above issue, we propose XAgg, which accelerates heterogeneous gradient aggregation by deploying the eXpress Data Path (XDP) based aggregator on servers. Specifically, the abundant idle memory on servers can cache the entire gradient, so as to effectively deal with asynchronous gradient transmission in heterogeneous scenarios. Moreover, XDP can provide high-performance and low-latency gradient aggregation. We conduct microbenchmark and testbed with real-world DNN models and datasets. Experimental results show that XAgg improves the gradient aggregation throughput by 3.3 $\times$ compared with TCP-based aggregation, reaching 100 Gbps with 10 CPU cores. In addition, XAgg reduces communication time by 49%-82% compared with state-of-the-art solutions.
What problem does this paper attempt to address?