Abstract:Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

Laius: Towards Latency Awareness and Improved Utilization of Spatial Multitasking Accelerators in Datacenters

Laius: T owards l atency a wareness and i mproved u tilization of s patial multitasking accelerators in datacenters

Toward QoS-Awareness and Improved Utilization of Spatial Multitasking GPUs

AI-oriented Workload Allocation for Cloud-Edge Computing.

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers.

Baymax

Arcus: SLO Management for Accelerators in the Cloud with Traffic Shaping

CHARM: Collaborative Host and Accelerator Resource Management for GPU Datacenters

Accelerator-as-a-Service in Public Clouds: An Intra-Host Traffic Management View for Performance Isolation in the Wild

ArkGPU: enabling applications’ high-goodput co-location execution on multitasking GPUs

3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud

PAC: Preference-Aware Co-location Scheduling on Heterogeneous NUMA Architectures to Improve Resource Utilization.

Locality-Aware Work Stealing Based on Online Profiling and Auto-Tuning for Multisocket Multicore Architectures

Laws: Locality-Aware Work-Stealing For Multi-Socket Multi-Core Architectures

Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms

Enhanced GPU Resource Utilization through Fairness-aware Task Scheduling

Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds

Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction

A Comprehensive Evaluation of FPGA-Based Spatial Acceleration of LLMs

FLARE: Flexibly Sharing Commodity GPUs to Enforce QoS and Improve Utilization

MARS: Exploiting Multi-Level Parallelism for DNN Workloads on Adaptive Multi-Accelerator Systems