GVARP: Detecting Performance Variance on Large-Scale Heterogeneous Systems

Xin You,Zhibo Xuan,Hailong Yang,Zhongzhi Luan,Yi Liu,Depei Qian
DOI: https://doi.org/10.1109/sc41406.2024.00063
2024-01-01
Abstract:Performance variance is one of the nasty pitfalls of large-scale heterogeneous systems, which can lead to unexpected and unpredictable performance degradation for parallel programs. Such performance issues typically arise from various random hardware and software faults, making it exceedingly difficult to pinpoint the exact causes of performance variance in specific instances. In this paper, we propose GVARP, a performance variance detection tool for large-scale heterogeneous systems. GVARP employs static analysis to identify the performance-critical parameters of kernel functions. Additionally, GVARP segments the program execution with external library calls and asynchronous kernel operations. Then GVARP constructs a state transfer graph and estimates the workload of each program segment to identify and cluster instances of similar workloads, facilitating the detection of performance variance. Our evaluation results demonstrate that GVARP effectively detects performance variance at a large scale with acceptable overhead and provides intuitive insights to locate the sources of performance variance.
What problem does this paper attempt to address?