Preemptive Switch Memory Usage to Accelerate Training Jobs with Shared In-Network Aggregation

Hao Wang,Yuxuan Qin,ChonLam Lao,Yanfang Le,Wenfei Wu,Kai Chen
DOI: https://doi.org/10.1109/icnp59255.2023.10355574
2023-01-01
Abstract:Recent works introduce In-Network Aggregation (INA) for distributed training (DT), which moves the gradient summation into network programmable switches. INA can reduce the traffic volume and accelerate communication in DT jobs. However, switch memory is a scarce resource, unable to support massive DT jobs in data centers, and existing INA solutions have not utilized switch memory to the best extent. We propose DSA, an Efficient Data-Plane switch memory Scheduler for in-network Aggregation. DSA introduces preemption to the switch memory management for INA jobs. In the data plane, DSA allows gradient tensors with high priority to preempt the switch aggregators (basic computation unit in INA) from tensors with low priority, which avoids an aggregator wasting time in idle. In the control plane, DSA devises a priority policy which assigns high priority to gradient tensors that benefit overall job efficiency more, e.g., communication-intensive jobs. We prototype DSA and experiments show that DSA can improve the average JCT by up to 1.35x compared with baseline solutions.
What problem does this paper attempt to address?