Abstract:Heterogeneous nodes composed of a multicore CPU and accelerators are today's norm in high-performance computing (HPC) platforms due to their superior performance and energy efficiency. Tools such as OpenCL and hybrid combinations such as OpenMP plus OpenACC are used for developing portable parallel programs for such nodes. However, these tools have some drawbacks, including a lack of compiler support for nested parallelism, performance portability, automatic heterogeneous workload distribution, user-friendly thread placement, and processor affinity essential to the portable performance of hybrid programs executing on such nodes. In this paper, we propose OpenH, a novel programming model and library API for developing portable parallel programs on heterogeneous hybrid servers composed of a multicore CPU and one or more different types of accelerators. OpenH integrates Pthreads, OpenMP, and OpenACC seamlessly to facilitate the development of hybrid parallel programs. An OpenH hybrid parallel program starts as a single main thread, creating a group of Pthreads called hosting Pthreads. A hosting Pthread then leads the execution of a software component of the program, either an OpenMP multithreaded component running on the CPU cores or an OpenACC (or OpenMP) component running on one of the accelerators of the server. The OpenH library provides API functions that allow programmers to get the configuration of the executing environment and bind the hosting Pthreads (and hence the execution of components) of the program to the CPU cores of the hybrid server to get the best performance. We illustrate the OpenH programming model and library API using two hybrid parallel applications based on matrix multiplication and 2D fast Fourier transform for the most general case of a hybrid hyperthreaded server comprising computing devices. Finally, we demonstrate the practical performance and energy consumption of OpenH for the hybrid parallel matrix multiplication application on a server comprising an Intel Icelake multicore CPU and two Nvidia A40 GPUs.

Gemma in April: A matrix-like parallel programming architecture on OpenCL

A parallel computing method for irregular work

Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code and Large-Scale Performance Test on TH-1A.

AN PARALLEL AND DISTRIBUTED PROGRAMMING SOLUTION BASED ON HETEROGENEOUS GPU CLUSTER

A Programming Framework Based on Multi-GPU

Experience Of Parallelizing Cryo-Em 3d Reconstruction On A Cpu-Gpu Heterogeneous System

High Level Programming for Heterogeneous Architectures

A coordinated tiling and batching framework for efficient GEMM on GPUs.

Programming Framework for Node Heterogeneous GPU Cluster

GPU First -- Execution of Legacy CPU Codes on GPUs

Parallelism for cryo-EM 3D reconstruction on CPU-GPU heterogeneous system

GAMMA: A Graph Pattern Mining Framework for Large Graphs on GPU.

Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

Mapcg: Writing Parallel Program Portable Between Cpu And Gpu

A GPU-based Graph Pattern Mining System.

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

FT-GEMM: A Fault Tolerant High Performance GEMM Implementation on x86 CPUs

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

GPU Parallel Computing: Programming Language, Debugging Tools and Data Structures

PARRAY

Automatic Task Assignment System of General Computing Oriented GPU Cluster