Scalable Tracing of MPI Events and Performance Metrics

Tao Yan,Qingguo Xu,Jiyu Luo,Jingwei Sun,Guangzhong Sun
DOI: https://doi.org/10.1109/IPDPSW59300.2023.00123
2023-01-01
Abstract:Tracing is a basic approach to analyzing performance and understanding MPI program behavior patterns. However, MPI event trace requires increasingly large storage space as the parallel scale grows. Besides MPI event trace, many performance analysis tasks (e.g., performance variance detection, proxy synthesis) also require detailed runtime performance metrics, which further aggravates the storage issue. In this paper, we propose a scalable tracing tool to effectively record and compress MPI event trace and related runtime performance metrics. The tool analyzes the data redundancy caused by loops and SPMD (single program multiple data) property of MPI programs. According to the analysis, the tool can compactly reorganize and store the data. Compared with existing trace compression methods, our tool can achieve generally higher compression ratio and less time cost.
What problem does this paper attempt to address?