Latency-driven Model Placement for Efficient Edge Intelligence Service
Pei-Yu Lin,Zhichen Shi,Zheng Xiao,Cen Chen,Kenli Li
DOI: https://doi.org/10.1109/services55459.2022.00028
2022-01-01
Abstract:Deep learning services are extensively required and have powerful expected effects in a wide range of applications, such as auto-self driving, voice assistant and so on.Traditionally, deep learning services are mainly provided based on cloud computing, referred to as cloud intelligence services, where the deep learning model is deployed in the cloud, and endusers need to upload data through wireless and core network when requesting training and inference services. However, the cloud computing-based deep learning services have deficiencies in latency, privacy, etc. For example, the deep learning service users do not want their private data to leak, and the privacy problem is difficult to solve when the data is uploading to the cloud. With the widespread use of the Internet of Things (IoT) and the rapid development of mobile devices such as smartphones and IoT sensors, large amounts of data need to be used for a variety of real-time deep learning services, such as target recognition and voice recognition for smart cities, smart medical care, and the Internet of Vehicles (IoVs). With a certain network bandwidth, a large amount of data uploaded to the cloud will cause network congestion and greatly increase the response time. To meet the requirements of low latency, researchers have begun to consider the deployment of deep learning services in edges, i.e., edge intelligence service.In edge intelligence services, the computation capability and memory of processors (or devices) are different from a large. At the same time, the requirement of memory size of deep neural network (DNN) models is increasing, such as the memory usage for Alexnet and Resnet are 2.12G and 16.20G separately. Also, in DNN model design, the branches are becoming common, which brings the parallelism. Deploying deep learning models on multiple processors can support the large-scale DNN models and the parallel implementation of DNN model, where the computation of a deep learning model can be conducted in parallel is a possible solution to improve the efficiency of edge intelligence services. The key point in edge intelligence services is how to partition and assign the implementation of the DNN model.In this paper, we propose a novel latency-driven deep learning model placement method for efficient edge intelligence service. Model placement contains two procedures: model partition and sub-models assignment. In our method, we first convert a DNN model into an execution graph, which is a directed acyclic graph (DAG), and propose a novel latency-driven multilevel graph partition for the model. Then the partitioned submodels are heuristically assigned to available processors. To the best of our knowledge, it is the first work that proposes latency-driven graph partition algorithms for model placement. Extensive experiments on several commonly used DNN models and synthetic datasets show that our method can achieve the lowest execution latency with low complexity compared with other state-of-the-art model placement methods.