Abstract:Deep neural networks (DNNs) have been widely used in many intelligent applications such as object recognition and automatic driving due to their superior performance in conducting inference tasks. However, DNN models are usually heavyweight in computation, hindering their utilization on the resource-constraint Internet of Things (IoT) end devices. To this end, cooperative deep inference is proposed, in which a DNN model is adaptively partitioned into two parts and different parts are executed on different devices (cloud or edge end devices) to minimize the total inference latency. One important issue is thus to find the optimal partition of the deep model subject to network dynamics in a real-time manner. In this paper, we formulate the optimal DNN partition as a min-cut problem in a directed acyclic graph (DAG) specially derived from the DNN and propose a novel two-stage approach named quick deep model partition (QDMP) to solve it. QDMP exploits the fact that the optimal partition of a DNN model must be between two adjacent cut vertices in the corresponding DAG. It first identifies the two cut vertices and considers only the subgraph in between when calculating the min-cut. QDMP can find the optimal model partition with response time less than 300ms even for large DNN models containing hundreds of layers (up to 66.3x faster than the state-of-the-art solution), and thus enables real-time cooperative deep inference over the cloud and edge end devices. Moreover, we observe one important fact that is ignored in all previous works: As many deep learning frameworks optimize the execution of DNN models, the execution latency of a series of layers in a DNN does not equal to the summation of each layer's independent execution latency. This results in inaccurate inference latency estimation in existing works. We propose a new execution latency measurement method, with which the inference latency can be accurately estimated in practice. We implement QDMP on real hardware and use a real-world self-driving car video dataset to evaluate its performance. Experimental results show that QDMP outperforms the state-of-the-art solution, reducing inference latency by up to 1.69x and increasing throughput by up to 3.81x.

Real-time Adaptive Partition and Resource Allocation for Multi-user End-cloud Inference Collaboration in Mobile Environment

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Adaptive Deep Inference Framework for Cloud-Edge Collaboration

DNN Real-Time Collaborative Inference Acceleration with Mobile Edge Computing

Collaborative DNNs Inference with Joint Model Partition and Compression in Mobile Edge-Cloud Computing Networks

HSMS-ADP: Adaptive DNNs Partitioning for End-Edge Collaborative Inference in High-Speed Mobile Scenarios

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Joint DNN Partition Deployment and Resource Allocation for Delay-Sensitive Deep Learning Inference in IoT

Joint DNN partitioning and resource allocation for completion rate maximization of delay-aware DNN inference tasks in wireless powered mobile edge computing

Accelerating DNN Inference by Edge-Cloud Collaboration

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Joint Optimization With DNN Partitioning and Resource Allocation in Mobile Edge Computing

Joint multi-user DNN partitioning and task offloading in mobile edge computing

On-Demand Deep Model Compression for Mobile Devices

Distributed DNN Inference with Fine-grained Model Partitioning in Mobile Edge Computing Networks

Joint Multi-User DNN Partitioning and Computational Resource Allocation for Collaborative Edge Intelligence

Collaborative Inference Acceleration Integrating DNN Partitioning and Task Offloading in Mobile Edge Computing

Collaborative Deep Neural Network Inference via Mobile Edge Computing

Joint DNN Partition and Resource Allocation Optimization for Energy-Constrained Hierarchical Edge-Cloud Systems