Abstract:The unrivaled computing capabilities of modern GPUs meet the demand of processing massive amounts of data seen in many application domains. While traditional HPC systems support applications as standalone entities that occupy entire GPUs, there are GPU-based DBMSs where multiple tasks are meant to be run at the same time in the same device. To that end, system-level resource management mechanisms are needed to fully unleash the computing power of GPUs in large data processing, and there were some researches focusing on it. In our previous work, we explored the single compute-bound kernel modeling on GPUs under NVidia’s CUDA framework and provided an in-depth anatomy of the NVidia’s concurrent kernel execution mechanism (CUDA stream). This paper focuses on resource allocation of multiple GPU applications towards optimization of system throughput in the context of systems. Comparing to earlier studies of enabling concurrent tasks support on GPU such as MultiQx-GPU, we use a different approach that is to control the launching parameters of multiple GPU kernels as provided by compile-time performance modeling as a kernel-level optimization and also a more general pre-processing model with batch-level control to enhance performance. Specifically, we construct a variation of multi-dimensional knapsack model to maximize concurrency in a multi-kernel environment. We present an in-depth analysis of our model and develop an algorithm based on dynamic programming technique to solve the model. We prove the algorithm can find optimal solutions (in terms of thread concurrency) to the problem and bears pseudopolynomial complexity on both time and space. Such results are verified by extensive experiments running on our microbenchmark that consists of real-world GPU queries. Furthermore, solutions identified by our method also significantly reduce the total running time of the workload, as compared to sequential and MultiQx-GPU executions.

Simultaneous Multikernel: Fine-Grained Sharing of GPUs.

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing

POSTER: Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

Effective GPU Sharing Under Compiler Guidance

A Virtual Multi-Channel GPU Fair Scheduling Method for Virtual Machines.

Quality of Service Support for Fine-Grained Sharing on GPUs.

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

Improving GPU Performance Through Resource Sharing

Efficient GPU Spatial-Temporal Multitasking

Efficient Kernel Management on GPUs.

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Preemption-Aware Kernel Scheduling for GPUs

Two-Stage Modeling and Control of Concurrent Tasks in a Multi-Kernel GPGPU Environment

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Fair and Cache Blocking Aware Warp Scheduling for Concurrent Kernel Execution on GPU

FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

Improving Multi-Application Concurrency Support Within the GPU Memory System

Concurrent query processing in a GPU-based database system

HeteroCore GPU to Exploit TLP-Resource Diversity

Cooperative Kernels: GPU Multitasking for Blocking Algorithms (Extended Version)