Identifying Performance Bottleneck in Shared In-Network Aggregation During Distributed Training

Chang Liu,Jiaqi Zheng,Wenfei Wu,Bohan Zhao,Wei Nie,Xiaoyu Yi,Guihai Chen
DOI: https://doi.org/10.1109/icpads60453.2023.00015
2023-01-01
Abstract:As the emergence of recently popular large language model, distributed training (DT) optimizes the performance via using different parallelization strategies, resource schedulers and advanced compression techniques. Meanwhile, a promising acceleration primitive, In-Network Aggregation (INA), offloads the gradient aggregation to programmable switches to further reduce the communication overhead using the switch memory. However, to the best of our knowledge, how to identify the performance bottleneck in real time remains challenging. In this paper, we build Argus, a performance bottleneck monitoring framework for INA. Argus implements an aggregation digest extracting mechanism for real-time monitoring of DT jobs at the multi-tenant, multi-rack clusters. Argus models aggregation to identify performance bottlenecks in aggregation, which assists the scheduler in deciding the resource allocation. Extensive evaluation and prototype implementation show that Argus provides real-time and packet-level aggregation monitoring for identifying bottlenecks in INA with minimal performance overhead.
What problem does this paper attempt to address?