Using Hardware Counter-Based Performance Model to Diagnose Scaling Issues of HPC Applications.

Nan Ding,Shiming Xu,Zhenya Song,Baoquan Zhang,Jingmei Li,Zhigao Zheng
DOI: https://doi.org/10.1007/s00521-018-3496-z
2018-01-01
Neural Computing and Applications
Abstract:Performance diagnosing for HPC applications can be extremely difficult due to their complicated performance behaviors. One hand, developers used to identify the potential performance bottlenecks by conducting detailed instrumentation, which may introduce significant performance overheads or even performance deviations. On the other hand, developers can only conduct small numbers of application runs for profiling the performance with the limitations on both computing resources and time duration. Meanwhile, the performance bottlenecks of HPC applications may vary with the degree of parallelism. To address these challenges, our paper proposes a systematic performance diagnosing method focusing on building an accurate and interpretable performance model with performance counters. Our method is able to diagnose the HPC application scaling issues by predicting its runtime and performance behaviors in different functions. After applying this modeling method on three real-world HPC applications, HOMME, CICE and OpenFoam, our evaluations show that our diagnosing method based on the performance model has the ability to diagnose the potential scaling issues, which is typically missed by the traditional performance diagnosing method and achieves about 10% prediction errors in a scale of 4096 MPI ranks on two problem sizes.
What problem does this paper attempt to address?