Abstract:Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of challenges. Finding the GPU percentage for right-sizing the GPU for each DNN through profiling, determining an optimal batching of requests to balance throughput improvement while meeting application-specific deadlines and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs to run in the GPU concurrently. To help allocate the appropriate GPU percentage (we call it the "Knee"), we develop and validate a model that estimates the parallelism each DNN can utilize. We also develop a lightweight optimization formulation to find an efficient batch size for each DNN operating with D-STACK. We bring together our optimizations and our spatio-temporal scheduler to provide a holistic inference framework. We demonstrate its ability to provide high throughput while meeting application SLOs. We compare D-STACK with an ideal scheduler that can allocate the right GPU percentage for every DNN kernel. D-STACK gets higher than 90 percent throughput and GPU utilization compared to the ideal scheduler. We also compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6X improvement in GPU utilization and up to 4X improvement in inference throughput.

ElasticRoom: Multi-Tenant DNN Inference Engine Via Co-design with Resource-constrained Compilation and Strong Priority Scheduling

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

Dynamic Space-Time Scheduling for GPU Inference

ElasticBatch: A Learning-Augmented Elastic Scheduling System for Batch Inference on MIG

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference

iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud

PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers

Elastic Deep Learning in Multi-Tenant GPU Clusters

Efficient CUDA stream management for multi-DNN real-time inference on embedded GPUs

Multi-user Co-inference with Batch Processing Capable Edge Server

GACER: Granularity-Aware ConcurrEncy Regulation for Multi-Tenant Deep Learning

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

Energy-Efficient GPU Clusters Scheduling for Deep Learning