Detecting Performance Variance for Parallel Applications Without Source Code

Jidong Zhai,Liyan Zheng,Feng Zhang,Xiongchao Tang,Haojie Wang,Teng Yu,Yuyang Jin,Shuaiwen Leon Song,Wenguang Chen
DOI: https://doi.org/10.1109/tpds.2022.3181799
IF: 5.3
2022-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:For parallel applications, performance variance is a critical issue that can degrade performance and make applications’ behavior difficult to explain. Therefore, users and application developers should be able to detect and diagnose performance variance. Previous detection methods either introduce too much overhead and slow down applications, or rely on nontrivial source code analysis, which is impractical for production-run parallel systems. In this article, we propose Vapro , a framework for detecting and diagnosing performance variance in production-run parallel systems. Our method is based on an observation that most parallel programs contain code snippets that are executed repeatedly with a fixed workload and can be utilized to detect performance variance. We present State Transition Graph (STG) to track program execution and then do light-weight workload analysis on STG to locate performance variance. Vapro is able to successfully identify these snippets at runtime even without program source code. To diagnose the discovered variation, Vapro uses a progressive diagnosis method based on a hybrid model combining variance breakdown and statistical analysis. According to evaluating results, Vapro 's performance overhead is only 1.38% on average. Vapro can identify performance variance in real applications caused by hardware issues, such as memory and IO. The standard deviation of the execution time is decreased by up to 73.5% when the identified variance is fixed. Vapro achieves 30.0% larger detection coverage than the state-of-the-art variance detection approach based on source code analysis.
What problem does this paper attempt to address?