Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

Guin Gilman,Robert J. Walls
DOI: https://doi.org/10.48550/arXiv.2110.00459
2021-10-01
Abstract:We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we examine scheduling at the microarchitectural level. We find that the lack of fine-grained preemption mechanisms, robust task prioritization options, and contention-aware thread block placement policies limits the effectiveness of NVIDIA's concurrency mechanisms. In summary, the sequential nature of deep learning workloads and their fluctuating resource requirements and kernel runtimes make executing such workloads while maintaining consistently high utilization and low, predictable turnaround times difficult on current NVIDIA hardware.
Distributed, Parallel, and Cluster Computing,Hardware Architecture,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily explores the performance issues of concurrency mechanisms in the new Ampere microarchitecture of NVIDIA GPUs under deep learning workloads. Specifically, the paper focuses on the following aspects: 1. **Effectiveness of Concurrency Mechanisms**: - How do the three current concurrency mechanisms provided by NVIDIA GPUs (priority streams, time slicing, and multi-process service) perform when executing deep learning training and inference tasks? - Can these mechanisms ensure that latency-sensitive inference requests have predictable and low turnaround times while fully utilizing idle resources for best-effort training tasks? 2. **Limitations of Existing Mechanisms**: - The paper identifies several significant limitations in the current concurrency mechanisms, such as the lack of fine-grained preemption, robust task priority options, and thread block placement strategies that consider contention. - Specifically, when using priority streams, the kernels of high-priority inference tasks often experience accumulated delays due to waiting for training task kernels. - The time slicing mechanism does not allow different applications to execute simultaneously on the GPU, making it difficult to improve utilization from serial execution. - While the multi-process service (MPS) can proportionally allocate resources to each application, it cannot assign scheduling priorities to tasks. 3. **Proposed Improvements**: - The paper suggests that implementing a fine-grained preemption mechanism could improve the turnaround time and utilization of concurrent deep learning workloads. - A fine-grained preemption mechanism would allow the GPU to preempt any specific subset of thread blocks during execution and resume them later, thereby reducing resource contention and improving the predictability of serving inference requests. Through these studies, the paper aims to provide guidance for the design of GPU concurrency mechanisms, particularly when handling deep learning workloads, to enhance performance and resource utilization.