CD-MSA: Cooperative and Deadline-Aware Scheduling for Efficient Multi-Tenancy on DNN Accelerators

Chunyang Wang,Yuebin Bai,Desen Sun
DOI: https://doi.org/10.1109/tpds.2023.3276759
IF: 5.3
2023-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:With DNN turning into the backbone of AI cloud services and propelling the emergence of INFerence-as-a-Service (INFaaS), DNN-specific accelerators have become the indispensable components of cloud inference systems. Due to the conservative “one-task-at-a-time” working mode and deadline blindness of those accelerators, implementing multi-tenancy that aims to improve the cost-effectiveness and meet SLA requirements is intractable. Recent studies including the temporal and spatial approaches, employ manifold scheduling mechanisms and sophisticated architecture innovations to address the challenge. However, these researches either still neglect the deadline awareness or render inevitable and expensive hardware overheads such as switches and storage. In this paper, we present Cooperative and Deadline-aware Multi-Systolic-Array scheduling (CD-MSA), a low-cost solution for the cloud inference that utilizes the real time mechanism and task-level parallelism to enable efficient multi-tenancy. Based on our preemptive multi-systolic-array accelerator architecture supporting the simultaneous task co-location, we first construct a fine-grained DNN execution model to lay the groundwork for the lightweight preemption. Second, we design a cooperative, deadline- and laxity-aware scheduler in conjunction with an efficient schedulability test method for better QoS guarantee without introducing additional hardware cost. Finally, to further promote the overall throughput, we propose dynamic task fusion, a software approach that fuses different tasks into the logically “multi-threading” tasks at runtime. We compare CD-MSA with several state-of-the-art researches across three multi-DNN workloads. The evaluation results show CD-MSA improves the latency-bounded throughput, SLA satisfaction rate and weighted system throughput by up to 62%, 63% and 27%, respectively.
What problem does this paper attempt to address?