Abstract:Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive resource, and boosting utilization of GPUs without causing performance degradation of individual workloads is an important and challenging problem. Although services like MPS support simultaneous execution of multiple co-operative kernels on a single device, they do not solve the above problem for uncooperative kernels, MPS being oblivious to the resource needs of each kernel. We propose a fully automated compiler-assisted scheduling framework. The compiler constructs GPU tasks by identifying kernel launches and their related GPU operations (e.g. memory allocations). For each GPU task, a probe is instrumented in the host-side code right before its launch point. At runtime, the probe conveys the information about the task's resource requirements (e.g. memory and compute cores) to a scheduler, such that the scheduler can place the task on an appropriate device based on the task's resource requirements and devices' load in a memory-safe, resource-aware manner. To demonstrate its advantages, we prototyped a throughput-oriented scheduler based on the framework, and evaluated it with the Rodinia benchmark suite and the Darknet neural network framework on NVIDIA GPUs. The results show that the proposed solution outperforms existing state-of-the-art solutions by leveraging its knowledge about applications' multiple resource requirements, which include memory as well as SMs. It improves throughput by up to 2.5x for Rodinia benchmarks, and up to 2.7x for Darknet neural networks. In addition, it improves job turnaround time by up to 4.9x, and limits individual kernel performance degradation to at most 2.5%.

Weak Execution Ordering - Exploiting Iterative Methods on Many-Core GPUs.

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications

Improving Dense Linear Equation Solver on Hybrid CPU-GPU System.

An Expansion-Aided Synchronous Conservative Time Management Algorithm on GPU.

Towards Accelerating Irregular EDA Applications with GPUs.

A CPU-GPGPU Scheduler Based on Data Transmission Bandwidth of Workload

On Parallel Solution of Sparse Triangular Linear Systems in CUDA

Accelerating Dissipative Particle Dynamics Simulations on GPUs: Algorithms, Numerics and Applications

Effective GPU Sharing Under Compiler Guidance

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

GPU computing using concurrent kernels: A case study

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization.

Rapid prototyping of image processing workflows on massively parallel architectures

Generalized Gpu Acceleration For Applications Employing Finite-Volume Methods

Research and Implementation of Effective Jacobi Iteration Algorithms on GPU

A GPU Based Parallel Computing Mechanism of an Etching Profile Evolution Model.

Taming irregular EDA applications on GPUs.

Parallel singular value decomposition on heterogeneous multi-core and multi-GPU platforms

Characterizing the Execution Dynamics of GPGPU Applications