Latency-Based Inter-Operator Scheduling for CNN Inference Acceleration on GPU

Yukai Ping,He Jiang,Xingxiang Liu,Zhenyang Zhao,Zhide Zhou,Xin Chen
DOI: https://doi.org/10.1109/tsc.2023.3345952
IF: 11.019
2023-01-01
IEEE Transactions on Services Computing
Abstract:Convolutional Neural Networks (CNNs) are widely deployed on the Graphics Processing Unit (GPU) to support Deep Learning (DL) based services. Popular DL frameworks usually ignore the inter-operator parallelism when executing the inference of CNNs, which results in high inference latency. Although some inter-operator scheduling methods have been proposed, there remains a critical trade-off issue between inference latency (effectiveness) and scheduling time (efficiency). In this paper, we propose LIOS, a novel latency-based heuristic inter-operator scheduling method to balance inference latency and scheduling time. In LIOS, a CNN latency model is built based on the given CNN and GPU. Then every operator is assigned a priority value to represent its importance. During each iteration of the scheduling process, LIOS identifies the current data-independent operators, selects the operator with the highest priority value, and assigns it to the GPU stream with the smallest finish time. Extensive experimental results have demonstrated the effectiveness and efficiency of LIOS. For the effectiveness, LIOS can speed up the inference of normal-size and large-size CNNs by 1.13 $\sim 1.59\times$ compared to sequential scheduling. This result is comparable to IOS, the latest state-of-the-art scheduling method. For the efficiency, LIOS can speed up the scheduling process by 7 $\sim 9210\times$ compared to IOS.
What problem does this paper attempt to address?