Abstract:To support performance measurement and analysis of GPU-accelerated applications, we extended the HPCToolkit performance tools with several novel features. To support efficient monitoring of accelerated applications, HPCToolkit employs a new wait-free data structure to coordinate measurement and attribution between each application thread and a GPU monitor thread. To help developers understand the performance of accelerated applications, HPCToolkit attributes metrics to heterogeneous calling contexts that span both CPUs and GPUs. To support fine-grain analysis and tuning of GPU-accelerated code, HPCToolkit collects PC samples of both CPU and GPU activity to derive and attribute metrics at all levels in a heterogeneous calling context.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in performance measurement and analysis in GPU - accelerated applications. Specifically, developing applications to utilize the capabilities of GPUs is more complex than traditional processor programming, and if not carefully designed, it may lead to under - utilization of GPU capabilities. In addition, using higher - level programming models (such as RAJA, Kokkos, and OpenMP) simplifies the programming burden and improves portability, but also increases the difficulty of optimizing kernel performance because these models isolate developers from many key details, such as the generated GPU code and the execution method. To solve these problems, the paper introduces a tool named POSTER. This tool extends the HPCToolkit performance toolset and introduces several new features aimed at supporting effective monitoring of GPU - accelerated applications, understanding their performance, and conducting fine - grained analysis and tuning. These new features include: 1. **Efficient monitoring mechanism**: Use a new wait - free data structure to coordinate measurement and attribution between each application thread and the GPU monitoring thread. 2. **Cross - platform context attribution**: Attribute metrics to heterogeneous call contexts across CPUs and GPUs to help developers understand the performance of accelerated applications. 3. **Fine - grained analysis and tuning**: Collect program counter (PC) samples of CPU and GPU activities to derive and attribute metrics at all levels of heterogeneous call contexts. Through these improvements, the POSTER tool can provide in - depth insights into the performance bottlenecks of GPU - accelerated applications, thereby guiding performance optimization work. The paper also demonstrates the effectiveness of this tool through three case studies, which cover performance analysis of large - scale applications and individual kernels.

A tool for top-down performance analysis of GPU-accelerated applications

Tools for top-down performance analysis of GPU-accelerated applications

Tools for GPU Computing - Debugging and Performance Analysis of Heterogenous HPC Applications

Workload Analysis for Typical GPU Programs Using CUPTI Interface

Research On Performance Tool In Hpc And Grid Computing

Web-Oriented Visual Performance Analysis Tool for Hpc: Thptiii

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

APMT: an Automatic Hardware Counter-Based Performance Modeling Tool for HPC Applications

Data-Driven Analysis to Understand GPU Hardware Resource Usage of Optimizations

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Programming-Level Power Measurement for GPU Clusters

Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures

Analyzing CUDA workloads using a detailed GPU simulator

KLARAPTOR: A Tool for Dynamically Finding Optimal Kernel Launch Parameters Targeting CUDA Programs

Web-oriented Performance Tool for High Performance Computing Systems

HPC System Software Enhanced by Source Code Analysis

Daisen: A Framework for Visualizing Detailed GPU Execution

Towards a Benchmarking Suite for Kernel Tuners

Advanced Python Performance Monitoring with Score-P

Performance Analysis and Optimization of a Hybrid Distributed Reverse Time Migration Application

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments