Abstract:Heterogeneous nodes composed of a multicore CPU and accelerators are today's norm in high-performance computing (HPC) platforms due to their superior performance and energy efficiency. Tools such as OpenCL and hybrid combinations such as OpenMP plus OpenACC are used for developing portable parallel programs for such nodes. However, these tools have some drawbacks, including a lack of compiler support for nested parallelism, performance portability, automatic heterogeneous workload distribution, user-friendly thread placement, and processor affinity essential to the portable performance of hybrid programs executing on such nodes. In this paper, we propose OpenH, a novel programming model and library API for developing portable parallel programs on heterogeneous hybrid servers composed of a multicore CPU and one or more different types of accelerators. OpenH integrates Pthreads, OpenMP, and OpenACC seamlessly to facilitate the development of hybrid parallel programs. An OpenH hybrid parallel program starts as a single main thread, creating a group of Pthreads called hosting Pthreads. A hosting Pthread then leads the execution of a software component of the program, either an OpenMP multithreaded component running on the CPU cores or an OpenACC (or OpenMP) component running on one of the accelerators of the server. The OpenH library provides API functions that allow programmers to get the configuration of the executing environment and bind the hosting Pthreads (and hence the execution of components) of the program to the CPU cores of the hybrid server to get the best performance. We illustrate the OpenH programming model and library API using two hybrid parallel applications based on matrix multiplication and 2D fast Fourier transform for the most general case of a hybrid hyperthreaded server comprising computing devices. Finally, we demonstrate the practical performance and energy consumption of OpenH for the hybrid parallel matrix multiplication application on a server comprising an Intel Icelake multicore CPU and two Nvidia A40 GPUs.

Automatic translation of data parallel programs for heterogeneous parallelism through OpenMP offloading

Automatic Asynchronous Execution of Synchronously Offloaded OpenMP Target Regions

Static Generation of Efficient OpenMP Offload Data Mappings

OpenMP offloading data transfer optimization for DCUs

OpenMP Advisor

HeteroPP: A directive‐based heterogeneous cooperative parallel programming framework

O2ATH: an OpenMP Offloading Toolkit for the Sunway Heterogeneous Manycore Platform

Openuh: an Optimizing, Portable Openmp Compiler

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

POAS: a framework for exploiting accelerator level parallelism in heterogeneous environments

Optimizing Offload Performance in Heterogeneous MPSoCs

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

POAS: A high-performance scheduling framework for exploiting Accelerator Level Parallelism

Towards a Scalable and Efficient PGAS-based Distributed OpenMP

GPU First -- Execution of Legacy CPU Codes on GPUs

Effective GPU Sharing Under Compiler Guidance

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

OpenMP Compiler for Distributed Memory Architectures

AN PARALLEL AND DISTRIBUTED PROGRAMMING SOLUTION BASED ON HETEROGENEOUS GPU CLUSTER