Abstract:Deep neural networks (DNNs) have been widely used in many intelligent applications such as object recognition and automatic driving due to their superior performance in conducting inference tasks. However, DNN models are usually heavyweight in computation, hindering their utilization on the resource-constraint Internet of Things (IoT) end devices. To this end, cooperative deep inference is proposed, in which a DNN model is adaptively partitioned into two parts and different parts are executed on different devices (cloud or edge end devices) to minimize the total inference latency. One important issue is thus to find the optimal partition of the deep model subject to network dynamics in a real-time manner. In this paper, we formulate the optimal DNN partition as a min-cut problem in a directed acyclic graph (DAG) specially derived from the DNN and propose a novel two-stage approach named quick deep model partition (QDMP) to solve it. QDMP exploits the fact that the optimal partition of a DNN model must be between two adjacent cut vertices in the corresponding DAG. It first identifies the two cut vertices and considers only the subgraph in between when calculating the min-cut. QDMP can find the optimal model partition with response time less than 300ms even for large DNN models containing hundreds of layers (up to 66.3x faster than the state-of-the-art solution), and thus enables real-time cooperative deep inference over the cloud and edge end devices. Moreover, we observe one important fact that is ignored in all previous works: As many deep learning frameworks optimize the execution of DNN models, the execution latency of a series of layers in a DNN does not equal to the summation of each layer's independent execution latency. This results in inaccurate inference latency estimation in existing works. We propose a new execution latency measurement method, with which the inference latency can be accurately estimated in practice. We implement QDMP on real hardware and use a real-world self-driving car video dataset to evaluate its performance. Experimental results show that QDMP outperforms the state-of-the-art solution, reducing inference latency by up to 1.69x and increasing throughput by up to 3.81x.

NAIR: an Efficient Distributed Deep Learning Architecture for Resource Constrained IoT System

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Adaptive ResNet Architecture for Distributed Inference in Resource-Constrained IoT Systems

Efficient Deep Structure Learning for Resource-Limited IoT Devices

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

AdaInNet: an adaptive inference engine for distributed deep neural networks offloading in IoT-FOG applications based on reinforcement learning

Toward Secure and Efficient Deep Learning Inference in Dependable IoT Systems

Toward Decentralized and Collaborative Deep Learning Inference for Intelligent IoT Devices

Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices

Deploy Large-Scale Deep Neural Networks in Resource Constrained IoT Devices with Local Quantization Region

DeepThings: Distributed Adaptive Deep Learning Inference on Resource-Constrained IoT Edge Clusters

An efficient pruning scheme of deep neural networks for Internet of Things applications

Low Latency Deep Learning Inference Model for Distributed Intelligent IoT Edge Clusters

Resource-Efficient Distributed Deep Neural Networks Empowered by Intelligent Software-Defined Networking.

A Bi-Directional Co-Design Approach to Enable Deep Learning on IoT Devices

Memory- and Communication-Aware Model Compression for Distributed Deep Learning Inference on IoT

Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices

User-Distribution-Aware Federated Learning for Efficient Communication and Fast Inference

RL-DistPrivacy: Privacy-Aware Distributed Deep Inference for low latency IoT systems

A Lite Distributed Semantic Communication System for Internet of Things

Real-time Adaptive Partition and Resource Allocation for Multi-user End-cloud Inference Collaboration in Mobile Environment