Abstract:We investigate the performance of the concurrency mechanisms available on NVIDIA's new Ampere GPU microarchitecture under deep learning training and inference workloads. In contrast to previous studies that treat the GPU as a black box, we examine scheduling at the microarchitectural level. We find that the lack of fine-grained preemption mechanisms, robust task prioritization options, and contention-aware thread block placement policies limits the effectiveness of NVIDIA's concurrency mechanisms. In summary, the sequential nature of deep learning workloads and their fluctuating resource requirements and kernel runtimes make executing such workloads while maintaining consistently high utilization and low, predictable turnaround times difficult on current NVIDIA hardware.

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily explores the performance issues of concurrency mechanisms in the new Ampere microarchitecture of NVIDIA GPUs under deep learning workloads. Specifically, the paper focuses on the following aspects: 1. **Effectiveness of Concurrency Mechanisms**: - How do the three current concurrency mechanisms provided by NVIDIA GPUs (priority streams, time slicing, and multi-process service) perform when executing deep learning training and inference tasks? - Can these mechanisms ensure that latency-sensitive inference requests have predictable and low turnaround times while fully utilizing idle resources for best-effort training tasks? 2. **Limitations of Existing Mechanisms**: - The paper identifies several significant limitations in the current concurrency mechanisms, such as the lack of fine-grained preemption, robust task priority options, and thread block placement strategies that consider contention. - Specifically, when using priority streams, the kernels of high-priority inference tasks often experience accumulated delays due to waiting for training task kernels. - The time slicing mechanism does not allow different applications to execute simultaneously on the GPU, making it difficult to improve utilization from serial execution. - While the multi-process service (MPS) can proportionally allocate resources to each application, it cannot assign scheduling priorities to tasks. 3. **Proposed Improvements**: - The paper suggests that implementing a fine-grained preemption mechanism could improve the turnaround time and utilization of concurrent deep learning workloads. - A fine-grained preemption mechanism would allow the GPU to preempt any specific subset of thread blocks during execution and resume them later, thereby reducing resource contention and improving the predictability of serving inference requests. Through these studies, the paper aims to provide guidance for the design of GPU concurrency mechanisms, particularly when handling deep learning workloads, to enhance performance and resource utilization.

Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

Kernel concurrency opportunities based on GPU benchmarks characterization

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Analyzing CUDA workloads using a detailed GPU simulator

Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences

ACS: Concurrent Kernel Execution on Irregular, Input-Dependent Computational Graphs

Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

Characterizing the Execution Dynamics of GPGPU Applications

Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency Analysis

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels

Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training

The anachronism of whole-GPU accounting

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads

POSTER: Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls

Improving Multi-Application Concurrency Support Within the GPU Memory System

Efficient Synchronization Primitives for GPUs

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads