Abstract:Following the advances in technology scaling, graphics processing units (GPUs) incorporate an increasing amount of computing resources and it becomes difficult for a single GPU kernel to fully utilize the vast GPU resources. One solution to improve resource utilization is concurrent kernel execution (CKE). Early CKE mainly targets the leftover resources. However, it fails to optimize the resource utilization and does not provide fairness among concurrent kernels. Spatial multitasking assigns a subset of streaming multiprocessors (SMs) to each kernel. Although achieving better fairness, the resource underutilization within an SM is not addressed. Thus, intra-SM sharing has been proposed to issue thread blocks from different kernels to each SM. However, as shown in this study, the overall performance may be undermined in the intra-SM sharing schemes due to the severe interference among kernels. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory instructions in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various approaches to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two approaches to reduce memory pipeline stalls. The first is to balance memory accesses of concurrent kernels. The second is to limit the number of inflight memory instructions issued from individual kernels. Our evaluation shows that the proposed schemes significantly improve the weighted speedup of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK, by 24.6% and 27.2% on average, respectively, with lightweight hardware overhead.

Improving Inter-kernel Data Reuse With CTA-Page Coordination in GPGPU

A Framework for Memory Oversubscription Management in Graphics Processing Units

POSTER: Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs

A Compiler-assisted Locality Aware CTA Mapping Scheme

Adaptive Cache and Concurrency Allocation on GPGPUs

Intra-Cluster Coalescing and Distributed-Block Scheduling to Reduce GPU NoC Pressure.

ICCAD : U : Optimizing GPU Shared Memory Allocation in Automated Cto-CUDA Compilation

Orchestrating Cache Management and Memory Scheduling for GPGPU Applications.

Adaptive Cache Management for Energy-Efficient GPU Computing.

Optimizing Non-Coalesced Memory Access for Irregular Applications with GPU Computing

A Run-Time Optimization Approach for Reducing Data Movements Using Locality-Aware Searching

Equidistant Memory Access Coalescing on GPGPU

Shared Last-Level Cache Management for GPGPUs with Hybrid Main Memory

A Quantitative Evaluation of Unified Memory in GPUs

CPU-assisted GPU thread pool model for dynamic task parallelism

A Survey of GPGPU Parallel Processing Architecture Performance Optimization

Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture

Locality Protected Dynamic Cache Allocation Scheme on GPUs

Combining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs

Intra-Cluster Coalescing to Reduce GPU NoC Pressure