Minimizing Latency for Multi-DNN Inference on Resource-Limited CPU-Only Edge Devices

Tao Wang,Tuo Shi,Xiulong Liu,Jianping Wang,Bin Liu,Yingshu Li,Yechao She
DOI: https://doi.org/10.1109/infocom52122.2024.10621120
2024-01-01
Abstract:Despite considerable advancements in specialized hardware, the majority of IoT edge devices still rely on CPUs. The burgeoning number of IoT users amplifies the challenges associated with performing multiple Deep Neural Network inferences on these resource-limited, CPU-only edge devices. Existing strategies, including model compression, hardware acceleration, and model partitioning, often involve a trade-off in inference accuracy, are unsuitable due to hardware specificity, or lead to inefficient resource utilization. In response to these challenges, this paper introduces L-PIC (Latency Minimized Parallel Inference on CPU)—a framework expressly devised to optimize resource allocation, decrease inference latency, and maintain result accuracy on CPU-only edge devices. A series of comprehensive experiments have verified the superior efficiency and effectiveness of the L-PIC framework in comparison to the state-of-the-art method. Remarkably, compared to the state-of-the-art method, L-PIC can reduce the inference latency of multi-DNN by an average of approximately 30% across all tested scenarios.
What problem does this paper attempt to address?