Abstract:Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

Automatic execution of single-GPU computations across multiple GPUs

AEML: An Acceleration Engine for Multi-GPU Load-balancing in Distributed Heterogeneous Environment

AMOEBA: A Coarse Grained Reconfigurable Architecture for Dynamic GPU Scaling

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs.

Efficient GPU Spatial-Temporal Multitasking

AmgX: A Library for GPU Accelerated Algebraic Multigrid and Preconditioned Iterative Methods

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

Multi-GPU acceleration of large-scale density-based topology optimization

MGSim + MGMark: A Framework for Multi-GPU System Research

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

Extending DD-$α$AMG on heterogeneous machines

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Improving Multi-Application Concurrency Support Within the GPU Memory System

GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics

Dynamic Load Balancing in GPU-Based Systems - Early Experiments

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Batch-Aware Unified Memory Management in GPUs for Irregular Workloads

An adaptive finite element multigrid solver using GPU acceleration