ALEPH: Accelerating Distributed Training with Ebpf-Based Hierarchical Gradient Aggregation

Peng Yang,Hongli Xu,Gongming Zhao,Qianyu Zhang,Jiawei Liu,Chunming Qiao
DOI: https://doi.org/10.1109/tnet.2024.3404999
2024-01-01
Abstract:Distributed training includes two important operations: gradient transmission and gradient aggregation, which will consume massive bandwidth and computing resources. To achieve efficient distributed training, one must overcome two critical challenges: heterogeneity of bandwidth resources and limitation of computing resources among compute nodes. Existing architectures based on Parameter Server (PS) and All-Reduce (AR) fail to cope with these challenges because the PS will aggregate gradients from all workers and suffers from bandwidth bottlenecks, while AR intends to alleviate bandwidth bottlenecks at the PS, but the workers need to process many gradient packets thus can be overloaded. To address these shortcomings, we design a new distributed training system called ALEPH. In the control plane, ALEPH uses an efficient algorithm to group workers into clusters with different sizes so as to fully utilize heterogeneous bandwidth. We show that the proposed algorithm can achieve a good approximation performance. In the data plane, ALEPH leverages, for the first time, extended Berkeley Packet Filter (eBPF) programs to aggregate and forward gradient packets to reduce computation overhead. We show how to overcome several hurdles in using eBPF for distributed training. We implement ALEPH and evaluate its performance on a small-scale testbed and large-scale simulations. Experimental results show that ALEPH reduces training time by 20%-31% and increases bandwidth utilization by 88% compared with state-of-the-art frameworks.
What problem does this paper attempt to address?