Abstract:Deep neural networks (DNNs) have been widely used in many intelligent applications such as object recognition and automatic driving due to their superior performance in conducting inference tasks. However, DNN models are usually heavyweight in computation, hindering their utilization on the resource-constraint Internet of Things (IoT) end devices. To this end, cooperative deep inference is proposed, in which a DNN model is adaptively partitioned into two parts and different parts are executed on different devices (cloud or edge end devices) to minimize the total inference latency. One important issue is thus to find the optimal partition of the deep model subject to network dynamics in a real-time manner. In this paper, we formulate the optimal DNN partition as a min-cut problem in a directed acyclic graph (DAG) specially derived from the DNN and propose a novel two-stage approach named quick deep model partition (QDMP) to solve it. QDMP exploits the fact that the optimal partition of a DNN model must be between two adjacent cut vertices in the corresponding DAG. It first identifies the two cut vertices and considers only the subgraph in between when calculating the min-cut. QDMP can find the optimal model partition with response time less than 300ms even for large DNN models containing hundreds of layers (up to 66.3x faster than the state-of-the-art solution), and thus enables real-time cooperative deep inference over the cloud and edge end devices. Moreover, we observe one important fact that is ignored in all previous works: As many deep learning frameworks optimize the execution of DNN models, the execution latency of a series of layers in a DNN does not equal to the summation of each layer's independent execution latency. This results in inaccurate inference latency estimation in existing works. We propose a new execution latency measurement method, with which the inference latency can be accurately estimated in practice. We implement QDMP on real hardware and use a real-world self-driving car video dataset to evaluate its performance. Experimental results show that QDMP outperforms the state-of-the-art solution, reducing inference latency by up to 1.69x and increasing throughput by up to 3.81x.

AsyFunc

AsyFunc: A High-Performance and Resource-Efficient Serverless Inference System via Asymmetric Functions

Automating Cloud Deployment for Real-Time Online Foundation Model Inference

AsyMo: Scalable and Efficient Deep-Learning Inference on Asymmetric Mobile CPUs

Automating Cloud Deployment for Deep Learning Inference of Real-time Online Services

Efficient Architecture Paradigm for Deep Learning Inference As a Service.

Cloud-Edge Inference under Communication Constraints: Data Quantization and Early Exit.

Latency-Driven Model Placement for Efficient Edge Intelligence Service

Online Learning for Orchestration of Inference in Multi-User End-Edge-Cloud Networks

CoFB: latency-constrained co-scheduling of flows and batches for deep learning inference service on the CPU–GPU system

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Allocating DNN Layers Computation Between Front-End Devices and the Cloud Server for Video Big Data Processing

Stateful Serverless Application Placement in MEC with Function and State Dependencies

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

Edge–IoT Computing and Networking Resource Allocation for Decomposable Deep Learning Inference

Λ DNN : Achieving Predictable Distributed DNN Training with Serverless Architectures

DVFO: Learning-Based DVFS for Energy-Efficient Edge-Cloud Collaborative Inference

DistMind: Efficient Resource Disaggregation for Deep Learning Workloads

Towards Real-time Cooperative Deep Inference over the Cloud and Edge End Devices