Abstract:GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from among many strategies for exploiting a GPU or a CPU. Recently, Machine Learning (ML) approaches have brought significant advances in the optimizations of HPC applications. To this end, several ways have been proposed to represent application characteristics for ML models. However, the available techniques fail to capture features that are crucial for exposing parallelism. In this paper, we introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree to represent control and data flow information. The originality of this work lies in the addition of new edges exploiting the implicit ordering and parent-child relationships in ASTs, as well as the introduction of edge weights to account for loop and condition information. We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region across CPUs and GPUs. Various transformations utilizing collapse and data transfer between the CPU and GPU are used to construct the dataset. The predicted runtime of the model is used to determine which transformation provides the best performance. Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.

Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system

Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs

8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks

An Empirical Roofline Model for Extreme-Scale I/O Workload Analysis

Applying the Roofline model for Deep Learning performance optimizations

Optimizing the Weather Research and Forecasting Model with OpenMP Offload and Codee

Realtime Ray Tracing on a Hibrid Parallel Architecture

A Hierarchical Grid Algorithm for Accelerating High-Performance Conjugate Gradient Benchmark on Sunway Many-Core Processor

A Comprehensive Methodology to Optimize FPGA Designs via the Roofline Model

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.

The Sparsity Roofline: Understanding the Hardware Limits of Sparse Neural Networks

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

Nek5000/RS Performance on Advanced GPU Architectures

A quantitative performance analysis model for GPU architectures

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

Unleashing the Performance Potential of CPU-GPU Platforms for the 3D Atmospheric Euler Solver.

Understanding GEMM Performance and Energy on NVIDIA Ada Lovelace: A Machine Learning-Based Analytical Approach

Optimizing the Performance of the Sparse Matrix-Vector Multiplication Kernel in FPGA Guided by the Roofline Model

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster