COS: Cross-Processor Operator Scheduling for Multi-Tenant Deep Learning Inference

Changyao Lin,Jie Liu
DOI: https://doi.org/10.1109/iwqos61813.2024.10682900
2024-01-01
Abstract:Multi-tenant inference, as a prevalent inference paradigm nowadays, requires deploying multiple deep learning models on the hardware platform to concurrently process inference tasks. Modern platforms are typically equipped with various heterogeneous processors, such as CPU-GPU platform. To reduce resource contention and improve Quality of Service (QoS) in the multi-tenant scenario, existing work has studied cross-processor inference at the model- and layer-level. However, coarse-grained scheduling cannot flexibly account for subtle resource fluctuations, which may lead to task blockages and incur significant processor switching overheads. Such work usually requires extensive modification and retraining of the models. Therefore, we propose a finer-grained operator-level cross-processor scheduling framework COS, which can more precisely divide the computational workloads and switching overheads for the tenants, without modifying or retraining. We introduce a novel intermediate representation to abstract and simplify the scheduling problem, and propose an efficient two-phase search algorithm. COS is automated and easy-to-scale, through experiments on various heterogeneous hardware platforms and models, we demonstrate that COS is more flexible and effective than layer-level scheduling, and achieves higher throughput than single-processor processing in the multi-tenant scenario. Furthermore, COS is an offline optimization method, and its overhead is highly acceptable.
What problem does this paper attempt to address?