BIRP: Batch-aware Inference Workload Redistribution and Parallel Scheme for Edge Collaboration.

Hesheng Sun,Xinyi Chen,Zhuzhong Qian,Zengji Li,Ning Chen,Tuo Cao,Suwei Xu,Yitong Zhou
DOI: https://doi.org/10.1145/3605573.3605615
2023-01-01
Abstract:The inference workload redistribution is a technique for evacuating inference requests from hot edges to idle edges in edge collaborative systems, thereby achieving inference workload balancing for inference on different edges. However, with the continuous development of edge accelerators, the resource utilization of edge accelerators in executing inference requests in series is often low, and when executing multiple inference requests in parallel, it faces uncertain execution delays, different response-time Service Level Objectives (SLOs), and the generality of inference workloads in heterogeneous edge collaborative systems. To address these issues, for the first time in the domain of inference workload redistribution, we propose a Batch-aware Inference workload Redistribution and Parallel execution scheme, called BIRP, to reduce the additional latency caused by waiting for a single inference task during serial execution, thereby improving the overall inference accuracy. BIRP uses the Multi-Armed Bandit (MAB) algorithm to adjust hyperparameters of the Throughput Improvement Ratio (TIR) function online for improving the overall inference accuracy. For nonlinear terms in the problem, BIRP uses a piecewise linear approximation to convert it into a Quadratic Programming (QP) problem, ensuring the effectiveness of BIRP in theory. We prototype BIRP on an edge collaborative system composed of three heterogeneous edges. Based on real inference workload trace, we validate the superiority of our algorithm compared to the state-of-the-art model selection-based inference workload redistribution algorithm, with an overall inference loss reduction of at least 32.9% and the failure rate of SLO has been reduced to 19.8% of alternatives.
What problem does this paper attempt to address?