Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications

Yuyang Jin,Haojie Wang,Xiongchao Tang,Zhenhua Guo,Yaqian Zhao,Torsten Hoefler,Tao Liu,Xu Liu,Jidong Zhai
DOI: https://doi.org/10.1109/tpds.2024.3485789
IF: 5.3
2024-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:It is challenging to scale parallel applications to modern supercomputers because of load imbalance, resource contention, and communications between processes. Profiling and tracing are two main performance analysis approaches for detecting these scalability bottlenecks. Profiling is low-cost but lacks detailed dependence for identifying root causes. Tracing records plentiful information but incurs significant overheads. To address these issues, we present ScalAna , which employs static analysis techniques to combine the benefits of profiling and tracing - it enables tracing's analyzability with overhead similar to profiling. ScalAna uses static analysis to capture program structures and data dependence of parallel applications, and leverages lightweight profiling approaches to record performance data during runtime. Then a parallel performance graph is generated with both static and dynamic data. Based on this graph, we design a backtracking detection approach to automatically pinpoint the root causes of scaling issues. We evaluate the efficacy and efficiency of ScalAna using several real applications with up to 704K lines of code and demonstrate that our approach can effectively pinpoint the root causes of scaling loss with an average overhead of 5.65% for up to 16,384 processes. By fixing the root causes detected by our tool, it achieves up to 33.01% performance improvement.
What problem does this paper attempt to address?