Abstract:In the last few years, many scientific applications have been developed for powerful graphics processing units (GPUs) and have achieved remarkable speedups. This success can be partially attributed to high performance host callable GPU library routines that are offloaded to GPUs at runtime. These library routines are based on C/C++-like programming toolkits such as CUDA from NVIDIA and have the same calling signatures as their CPU counterparts. Recently, with the sufficient support of C++ templates from CUDA, the emergence of template libraries have enabled further advancement in code reusability and rapid software development for GPUs. However, Expression Templates (ET), which have been very popular for implementing data parallel scientific software for host CPUs because of their intuitive and mathematics-like syntax, have been underutilized by GPU development libraries. The lack of ET usage is caused by the difficulty of offloading expression templates from hosts to GPUs due to the inability to pass instantiated expressions to GPU kernels as well as the absence of the exact form of the expressions for the templates at the time of coding. This paper presents a general approach that enables automatic offloading of C++ expression templates to CUDA enabled GPUs by using the C++ metaprogramming technique and Just-In-Time (JIT) compilation methodology to generate and compile CUDA kernels for corresponding expression templates followed by executing the kernels with appropriate arguments. This approach allows developers to port applications to run on GPUs with virtually no code modifications. More specifically, this paper uses a large ET based data parallel physics library called QDP++ as an example to illustrate many aspects of the approach to offload expression templates automatically and to demonstrate very good speedups for typical QDP++ applications running on GPUs against running on CPUs using this method of offloading. In addition, this approach of automatic offloading expression templates could be applied to other many-core accelerators that provide C++ programming toolkits with the support of C++ template.

Performance Portability Strategies for Grid C++ Expression Templates

Automatic Offloading C++ Expression Templates to CUDA Enabled GPUs.

Heterogeneous Programming and Optimization of Gyrokinetic Toroidal Code and Large-Scale Performance Test on TH-1A.

Design and optimization of a portable LQCD Monte Carlo code using OpenACC

Toward HPC application portability via C++ PSTL: the Gaia AVU-GSR code assessment

A Lightweight Approach to Performance Portability with targetDP

Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels

Exploring code portability solutions for HEP with a particle tracking test code

A Framework for Lattice QCD Calculations on GPUs

An approach to performance portability through generic programming

Evaluating Portable Parallelization Strategies for Heterogeneous Architectures in High Energy Physics

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

A Study of Performance Portability in Plasma Physics Simulations

Taking GPU Programming Models to Task for Performance Portability

Evaluating performance portability of five shared-memory programming models using a high-order unstructured CFD solver

Performance Portable Monte Carlo Particle Transport on Intel, NVIDIA, and AMD GPUs

Implementing Performance Portability of High Performance Computing Programs in the New Golden Age of Chip Architecture

Hybrid programming-model strategies for GPU offloading of electronic structure calculation kernels

Portable Programming Model Exploration for LArTPC Simulation in a Heterogeneous Computing Environment: OpenMP vs. SYCL

Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics