OfpCNN: On-Demand Fine-Grained Partitioning for CNN Inference Acceleration in Heterogeneous Devices

Lei Yang,Can Zheng,Xiaoyuan Shen,Guoqi Xie
DOI: https://doi.org/10.1109/tpds.2023.3321755
IF: 5.3
2023-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Collaborative inference is a promising method for balancing the limited computational power of Internet of Things (IoT) devices with the huge computational demands of convolutional neural networks (CNNs). In this approach, a CNN is divided into multiple partitions and placed on multiple devices to run simultaneously. However, two major challenges are raised. (1) Computational latencies vary when the central processing unit (CPU) loads of devices are different. However, no suitable methods are available for accurately determining computation latencies on the basis of CPU utilization. (2) Existing methods partition a CNN model either vertically or horizontally. The granularity of these methods is extremely coarse and their accuracy is low. To address the aforementioned issues, this study proposes a distributed collaborative inference framework that supports a fine-grained partitioning scheme for CNN in heterogeneous devices (hereafter referred to as OfpCNN). First, the framework uses the layer latency prediction model based on floating-point operations and CPU load (FCPM) to accurately predict the computation latency of each layer of CNN in different devices. Subsequently, OfpCNN uses horizontal and vertical partitioning methods (HVPM) to partition the input feature maps and the structure of CNN respectively in accordance with network conditions and computing capacity, then assigns them to multiple devices for execution. The HVPM solution overall considers the execution position of the layer, parallelism, and location of devices responsible for data aggregation and distribution, which can consequently obtain more fine-grained partition schemes. Experimental results show that FCPM can achieve a minimum accuracy of 88% and HVPM can improve the inference speed by 1–2.54 times compared with other state-of-the-art methods.
What problem does this paper attempt to address?