A tool for top-down performance analysis of GPU-accelerated applications

Keren Zhou,Mark Krentel,John Mellor-Crummey
DOI: https://doi.org/10.1145/3332466.3374534
2020-02-19
Abstract:To support performance measurement and analysis of GPU-accelerated applications, we extended the HPCToolkit performance tools with several novel features. To support efficient monitoring of accelerated applications, HPCToolkit employs a new wait-free data structure to coordinate measurement and attribution between each application thread and a GPU monitor thread. To help developers understand the performance of accelerated applications, HPCToolkit attributes metrics to heterogeneous calling contexts that span both CPUs and GPUs. To support fine-grain analysis and tuning of GPU-accelerated code, HPCToolkit collects PC samples of both CPU and GPU activity to derive and attribute metrics at all levels in a heterogeneous calling context.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges in performance measurement and analysis in GPU - accelerated applications. Specifically, developing applications to utilize the capabilities of GPUs is more complex than traditional processor programming, and if not carefully designed, it may lead to under - utilization of GPU capabilities. In addition, using higher - level programming models (such as RAJA, Kokkos, and OpenMP) simplifies the programming burden and improves portability, but also increases the difficulty of optimizing kernel performance because these models isolate developers from many key details, such as the generated GPU code and the execution method. To solve these problems, the paper introduces a tool named POSTER. This tool extends the HPCToolkit performance toolset and introduces several new features aimed at supporting effective monitoring of GPU - accelerated applications, understanding their performance, and conducting fine - grained analysis and tuning. These new features include: 1. **Efficient monitoring mechanism**: Use a new wait - free data structure to coordinate measurement and attribution between each application thread and the GPU monitoring thread. 2. **Cross - platform context attribution**: Attribute metrics to heterogeneous call contexts across CPUs and GPUs to help developers understand the performance of accelerated applications. 3. **Fine - grained analysis and tuning**: Collect program counter (PC) samples of CPU and GPU activities to derive and attribute metrics at all levels of heterogeneous call contexts. Through these improvements, the POSTER tool can provide in - depth insights into the performance bottlenecks of GPU - accelerated applications, thereby guiding performance optimization work. The paper also demonstrates the effectiveness of this tool through three case studies, which cover performance analysis of large - scale applications and individual kernels.