Abstract:In the last few years, many scientific applications have been developed for powerful graphics processing units (GPUs) and have achieved remarkable speedups. This success can be partially attributed to high performance host callable GPU library routines that are offloaded to GPUs at runtime. These library routines are based on C/C++-like programming toolkits such as CUDA from NVIDIA and have the same calling signatures as their CPU counterparts. Recently, with the sufficient support of C++ templates from CUDA, the emergence of template libraries have enabled further advancement in code reusability and rapid software development for GPUs. However, Expression Templates (ET), which have been very popular for implementing data parallel scientific software for host CPUs because of their intuitive and mathematics-like syntax, have been underutilized by GPU development libraries. The lack of ET usage is caused by the difficulty of offloading expression templates from hosts to GPUs due to the inability to pass instantiated expressions to GPU kernels as well as the absence of the exact form of the expressions for the templates at the time of coding. This paper presents a general approach that enables automatic offloading of C++ expression templates to CUDA enabled GPUs by using the C++ metaprogramming technique and Just-In-Time (JIT) compilation methodology to generate and compile CUDA kernels for corresponding expression templates followed by executing the kernels with appropriate arguments. This approach allows developers to port applications to run on GPUs with virtually no code modifications. More specifically, this paper uses a large ET based data parallel physics library called QDP++ as an example to illustrate many aspects of the approach to offload expression templates automatically and to demonstrate very good speedups for typical QDP++ applications running on GPUs against running on CPUs using this method of offloading. In addition, this approach of automatic offloading expression templates could be applied to other many-core accelerators that provide C++ programming toolkits with the support of C++ template.

Scalable, Fast Cloud Computing with Execution Templates

Execution Templates: Caching Control Plane Decisions for Strong Scaling of Data Analytics

NO2: Speeding Up Parallel Processing of Massive Compute-Intensive Tasks

Towards Optimizing Storage Costs on the Cloud

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

SNC: A Cloud Service Platform for Symbolic-Numeric Computation Using Just-In-Time Compilation

CASH: A Credit Aware Scheduling for Public Cloud Platforms

An Efficient Trust-Aware Task Scheduling Algorithm in Cloud Computing Using Firefly Optimization

Julia Cloud Matrix Machine: Dynamic Matrix Language Acceleration on Multicore Clusters in the Cloud

Scalable, Distributed AI Frameworks: Leveraging Cloud Computing for Enhanced Deep Learning Performance and Efficiency

Cloud Services Enable Efficient AI-Guided Simulation Workflows across Heterogeneous Resources

Templating Shuffles

Canary: A Scheduling Architecture for High Performance Cloud Computing

A Lightweight Execution Framework for Massive Independent Tasks

Automatic Offloading C++ Expression Templates to CUDA Enabled GPUs.

Cloud Benchmarking For Maximising Performance of Scientific Applications

Reproducible Performance Optimization of Complex Applications on the Edge-to-Cloud Continuum

Swift: Reliable and Low-Latency Data Processing at Cloud Scale

Variations in Performance and Scalability When Migrating n-Tier Applications to Different Clouds

SLA Aware Optimized Task Scheduling Model for Faster Execution of Workloads Among Federated Clouds

Boosting Cloud Data Analytics using Multi-Objective Optimization