Abstract:Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

Quality of Service Support for Fine-Grained Sharing on GPUs.

Gqos: A QoS-Oriented GPU Virtualization with Adaptive Capacity Sharing

QoS-aware Dynamic Resource Allocation with Improved Utilization and Energy Efficiency on GPU

Simultaneous Multikernel: Fine-Grained Sharing of GPUs.

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing

A Virtual Multi-Channel GPU Fair Scheduling Method for Virtual Machines.

Effective GPU Sharing Under Compiler Guidance

Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Towards QoS-Aware and Resource-Efficient GPU Microservices Based on Spatial Multitasking GPUs In Datacenters

Preemption-Aware Kernel Scheduling for GPUs

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

Priority-Based PCIe Scheduling for Multi-Tenant Multi-GPU Systems.

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks With Fine-Grain Utilization

Efficient Sharing and Fine-Grained Scheduling of Virtualized GPU Resources

KubeGPU: efficient sharing and isolation mechanisms for GPU resource management in container cloud

POSTER: FineCo: Fine-grained Heterogeneous Resource Management for Concurrent DNN Inferences

Improving GPU Performance Through Resource Sharing

Enabling Efficient Spatio-Temporal GPU Sharing for Network Function Virtualization

Concurrent analytical query processing with GPUs

POSTER: Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls