Abstract:Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of challenges. Finding the GPU percentage for right-sizing the GPU for each DNN through profiling, determining an optimal batching of requests to balance throughput improvement while meeting application-specific deadlines and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs to run in the GPU concurrently. To help allocate the appropriate GPU percentage (we call it the "Knee"), we develop and validate a model that estimates the parallelism each DNN can utilize. We also develop a lightweight optimization formulation to find an efficient batch size for each DNN operating with D-STACK. We bring together our optimizations and our spatio-temporal scheduler to provide a holistic inference framework. We demonstrate its ability to provide high throughput while meeting application SLOs. We compare D-STACK with an ideal scheduler that can allocate the right GPU percentage for every DNN kernel. D-STACK gets higher than 90 percent throughput and GPU utilization compared to the ideal scheduler. We also compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6X improvement in GPU utilization and up to 4X improvement in inference throughput.

Exploiting Intra-SM Parallelism in GPUs Via Persistent and Elastic Blocks.

Exploiting the Task-Pipelined Parallelism of Stream Programs on Many-Core GPUs.

SPGPU: Spatially Programmed GPU

Exploiting Parallelism in the Simulation of General Purpose Graphics Processing Unit Program

Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing

Efficient GPU Spatial-Temporal Multitasking

Improving GPU Performance Through Resource Sharing

Dynamic Space-Time Scheduling for GPU Inference

Optimizing the LINPACK Algorithm for Large-Scale PCIe-Based CPU-GPU Heterogeneous Systems

Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability.

Optimizing sparse matrix-vector multiplication based on gpu

Orchestrated Co-scheduling, Resource Partitioning, and Power Capping on CPU-GPU Heterogeneous Systems via Machine Learning

Packing Narrow-Width Operands to Improve GPU Performance

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

GPU Domain Specialization via Composable On-Package Architecture