Abstract:Deep neural networks (DNNs) have been widely used in many intelligent applications such as object recognition and automatic driving due to their superior performance in conducting inference tasks. However, DNN models are usually heavyweight in computation, hindering their utilization on the resource-constraint Internet of Things (IoT) end devices. To this end, cooperative deep inference is proposed, in which a DNN model is adaptively partitioned into two parts and different parts are executed on different devices (cloud or edge end devices) to minimize the total inference latency. One important issue is thus to find the optimal partition of the deep model subject to network dynamics in a real-time manner. In this paper, we formulate the optimal DNN partition as a min-cut problem in a directed acyclic graph (DAG) specially derived from the DNN and propose a novel two-stage approach named quick deep model partition (QDMP) to solve it. QDMP exploits the fact that the optimal partition of a DNN model must be between two adjacent cut vertices in the corresponding DAG. It first identifies the two cut vertices and considers only the subgraph in between when calculating the min-cut. QDMP can find the optimal model partition with response time less than 300ms even for large DNN models containing hundreds of layers (up to 66.3x faster than the state-of-the-art solution), and thus enables real-time cooperative deep inference over the cloud and edge end devices. Moreover, we observe one important fact that is ignored in all previous works: As many deep learning frameworks optimize the execution of DNN models, the execution latency of a series of layers in a DNN does not equal to the summation of each layer's independent execution latency. This results in inaccurate inference latency estimation in existing works. We propose a new execution latency measurement method, with which the inference latency can be accurately estimated in practice. We implement QDMP on real hardware and use a real-world self-driving car video dataset to evaluate its performance. Experimental results show that QDMP outperforms the state-of-the-art solution, reducing inference latency by up to 1.69x and increasing throughput by up to 3.81x.

Efficient Deep Learning Inference Based on Model Compression.

On-Demand Deep Model Compression for Mobile Devices

Improved Model Compression Method Based on Information Entropy

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference

A Model Compression Method Using Significant Data and Knowledge Distillation

Model Compression for Deep Neural Networks: A Survey

Deep Learning Model Compression Techniques: Advances, Opportunities, and Perspective

A Survey of Model Compression for Deep Neural Networks

A Novel Deep Learning Model Compression Algorithm

Using Distillation to Improve Network Performance after Pruning and Quantization

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Neural Network Compression Via Sparse Optimization

A Comprehensive Analysis of Low-Impact Computations in Deep Learning Workloads

Compressing Deep Model With Pruning and Tucker Decomposition for Smart Embedded Systems

Delta-DNN: Efficiently Compressing Deep Neural Networks Via Exploiting Floats Similarity.

Adaptive Compression-Aware Split Learning and Inference for Enhanced Network Efficiency

Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices

DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding