Abstract:Deep neural networks (DNNs) have been widely used in many intelligent applications such as object recognition and automatic driving due to their superior performance in conducting inference tasks. However, DNN models are usually heavyweight in computation, hindering their utilization on the resource-constraint Internet of Things (IoT) end devices. To this end, cooperative deep inference is proposed, in which a DNN model is adaptively partitioned into two parts and different parts are executed on different devices (cloud or edge end devices) to minimize the total inference latency. One important issue is thus to find the optimal partition of the deep model subject to network dynamics in a real-time manner. In this paper, we formulate the optimal DNN partition as a min-cut problem in a directed acyclic graph (DAG) specially derived from the DNN and propose a novel two-stage approach named quick deep model partition (QDMP) to solve it. QDMP exploits the fact that the optimal partition of a DNN model must be between two adjacent cut vertices in the corresponding DAG. It first identifies the two cut vertices and considers only the subgraph in between when calculating the min-cut. QDMP can find the optimal model partition with response time less than 300ms even for large DNN models containing hundreds of layers (up to 66.3x faster than the state-of-the-art solution), and thus enables real-time cooperative deep inference over the cloud and edge end devices. Moreover, we observe one important fact that is ignored in all previous works: As many deep learning frameworks optimize the execution of DNN models, the execution latency of a series of layers in a DNN does not equal to the summation of each layer's independent execution latency. This results in inaccurate inference latency estimation in existing works. We propose a new execution latency measurement method, with which the inference latency can be accurately estimated in practice. We implement QDMP on real hardware and use a real-world self-driving car video dataset to evaluate its performance. Experimental results show that QDMP outperforms the state-of-the-art solution, reducing inference latency by up to 1.69x and increasing throughput by up to 3.81x.

DNN Real-Time Collaborative Inference Acceleration with Mobile Edge Computing

Collaborative DNNs Inference with Joint Model Partition and Compression in Mobile Edge-Cloud Computing Networks

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Distributed DNN Inference with Fine-grained Model Partitioning in Mobile Edge Computing Networks

Collaborative Inference for Deep Neural Networks in Edge Environments

Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices

Accelerating DNN Inference by Edge-Cloud Collaboration

Dynamic DNN Decomposition for Lossless Synergistic Inference

Real-time Adaptive Partition and Resource Allocation for Multi-user End-cloud Inference Collaboration in Mobile Environment

Collaborative Inference Acceleration Integrating DNN Partitioning and Task Offloading in Mobile Edge Computing

On-Demand Deep Model Compression for Mobile Devices

HSMS-ADP: Adaptive DNNs Partitioning for End-Edge Collaborative Inference in High-Speed Mobile Scenarios

Adaptive Deep Inference Framework for Cloud-Edge Collaboration

DECC: Delay-Aware Edge-Cloud Collaboration for Accelerating DNN Inference

Model Parallelism Optimization for Distributed DNN Inference on Edge Devices.

Collaborative Deep Neural Network Inference via Mobile Edge Computing

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Delay-Aware DNN Inference Throughput Maximization in Edge Computing Via Jointly Exploring Partitioning and Parallelism