Abstract:With the development of deep learning, hardware accelerators represented by GPUs have been used to accelerate the execution of deep learning applications. A key problem in GPU cluster is how to schedule various deep learning applications, including training applications and latency-critical inference applications, to achieve optimal system performance. In cloud datacenters, inference applications often require fewer resources, and the exclusive GPU execution of one inference application can result in a significant waste of GPU resources. Existing work mainly focuses on the co-location execution of multiple inference applications in datacenters using MPS (Multi-Process Service). There are several problems with this execution pattern, datacenters may be in low-workload state for long periods of time due to the diurnal pattern of inference applications, MPS-based data sharing can lead to interaction errors between contexts, and resource contention may cause Quality of Service (QoS) violations. To solve above problems, we propose ArkGPU, a runtime system that dynamically allocates resources. ArkGPU can improve the resource utilization of the cluster, while guaranteeing the QoS of inference applications. ArkGPU is comprised of a performance predictor, a scheduler, a resource limiter, and an adjustment unit. We conduct extensive experiments on the NVIDIA V100 GPU to verify the effectiveness of ArkGPU. We achieve High-Goodput for latency-critical applications which have an average throughput increase of 584.27% compared to MPS. We deploy multiple applications simultaneously on ArkGPU, and in this case, goodput is improved by 94.98% compared to k8s-native and 38.65% compared to MPS.

GaiaGPU: Sharing GPUs in Container Clouds

KubeGPU: efficient sharing and isolation mechanisms for GPU resource management in container cloud

Transparent GPU Sharing in Container Clouds for Deep LearningWorkloads

GPU Scheduling for Short Tasks in Private Cloud

Houdini's Escape

gShare: A centralized GPU memory management framework to enable GPU memory sharing for containers

A Virtual Multi-Channel GPU Fair Scheduling Method for Virtual Machines.

Efficient Resource Sharing Through GPU Virtualization on Accelerated High Performance Computing Systems

Survey of GPU Virtualization

A Group Genetic Algorithm for Energy-Efficient Resource Allocation in Container-Based Clouds with Heterogeneous Physical Machines

Guardian: Safe GPU Sharing in Multi-Tenant Environments

Exploring the Diversity of Multiple Job Deployments over GPUs for Efficient Resource Sharing

DxPU: Large Scale Disaggregated GPU Pools in the Datacenter

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments

A Group Genetic Algorithm for Resource Allocation in Container-Based Clouds

Data Partitioning Strategy of GPU Heterogeneous Clusters Based on Learning

Hybrid Grouping Genetic Algorithm for Large-Scale Two-Level Resource Allocation of Containers in the Cloud

GPU Sharing with Triples Mode

Novel Genetic Algorithm with Dual Chromosome Representation for Resource Allocation in Container-Based Clouds

ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs