Abstract:Nearly 20 years after the birth of general-purpose GPU computing, the HPC landscape is now dominated by GPUs. After years of undisputed dominance by NVIDIA, new players have entered the arena in a convincing manner, namely AMD and more recently Intel, whose devices currently power the first two clusters in the Top500 ranking. Unfortunately, code porting is still a major problem, even more due to the presence of different vendors, but at the same time the emergence of simplified standard paradigms suggests an encouraging prospect for developers. In this work, we provide a detailed OpenMP porting strategy of STREAmS, a community code for the compressible fluid dynamics. The proposed porting technique is based on the offload functionality of the OpenMP 5.x paradigm and in particular on a hybrid directives/APIs approach that fits seamlessly into the multi-backend software ecosystem of STREAmS. We further carry out a comprehensive performance analysis on the Intel® Data Center GPU Max 1550 (formerly called Ponte Vecchio or PVC). In addition, we analyze the performance of the code on two benchmark clusters powered by PVC, including the exascale Aurora cluster. The performance is evaluated at different levels of parallelism involved, i.e., the intrinsic parallelism of the PVC tile, the inter-tile parallelism within the GPU configuration, between the GPUs within the node and between the nodes within the cluster. The analysis shows that although the implementation complexity of the OpenMP porting is limited, it is necessary to follow some important guidelines to achieve satisfactory performance. The PVC GPU shows about 40% higher performance than the NVIDIA A100 or AMD MI250X GPUs, which, however, were released about 3 years earlier. Both intra-node and internode scalability show good results. Overall, the introduction of PVC into the GPU computing HPC landscape represents a positive step forward for the diversification and competitiveness of the sector.

POPA: Expressing High and Portable Performance Across Spatial and Vector Architectures for Tensor Computations

TAPA-CS: Enabling Scalable Accelerator Design on Distributed HBM-FPGAs

Implementing Performance Portability of High Performance Computing Programs in the New Golden Age of Chip Architecture

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

POAS: a framework for exploiting accelerator level parallelism in heterogeneous environments

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Porting a sparse linear algebra math library to Intel GPUs

OpenMP offload toward the exascale using Intel® GPU Max 1550: evaluation of STREAmS compressible solver

A Lightweight Approach to Performance Portability with targetDP

Taking GPU Programming Models to Task for Performance Portability

GPU Implementation of a Sophisticated Implicit Low-Order Finite Element Solver with FP21-32-64 Computation Using OpenACC

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Evaluating performance portability of five shared-memory programming models using a high-order unstructured CFD solver

CuPBoP: CUDA for Parallelized and Broad-range Processors

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

GPU Domain Specialization via Composable On-Package Architecture

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Mamba: Portable Array-based Abstractions for Heterogeneous High-Performance Systems

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading