Enabling Latency-Sensitive DNN Inference Via Joint Optimization of Model Surgery and Resource Allocation in Heterogeneous Edge
Zhaowu Huang,Fang Dong,Dian Shen,Huitian Wang,Xiaolin Guo,Shucun Fu
DOI: https://doi.org/10.1145/3545008.3545071
2022-01-01
Abstract:Nowadays, edge computing is widely adopted to resolve the emerging deep neural networks (DNNs)-driven intelligence scenarios with the requirement of low-latency and high-accuracy, which includes heterogeneous end devices and DNNs. In such scenarios, the influx of data and computation into a shared edge server incurs prohibitive latency. Thus, we exploit the advantage of Multi-exit DNNs (ME-DNNs) that tasks can exit early at appropriate depths to save inference time. However, naively using ME-DNNs in the heterogeneous edge still fails to deliver fast inference due to improper model surgery and resource allocation. In this paper, we propose an Acceleration scheme for Inference based on ME-DNNs with Adaptive model surgery and resource allocation (AIMA) to accelerate DNN inferences. We model this problem as a mixed-integer programming problem that involves jointly optimizing model surgery and resource allocation to minimize the task completion time. We first determine the optimal resource allocation policy with a given model surgery decision profile, and then the model surgery decision-making is modeled as a weighted congestion game. We prove the existence of the Nash equilibrium and propose a decentralized algorithm. Extensive experimental results show that AIMA significantly outperforms the state-of-the-art methods, achieving up to 6.01 × speedup.
What problem does this paper attempt to address?