Abstract:The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, ‘edge-only’ and ‘cloud-only’ execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose PArtNNer , a platform-agnostic adaptive DNN partitioning algorithm that finds the optimal partitioning point in DNNs to minimize inference latency. PArtNNer can adapt to dynamic variations in communication bandwidth and cloud server load without requiring pre-characterization of underlying platforms. Experimental results for six image classification and object detection DNNs on a set of five commercial off-the-shelf compute platforms and three communication standards indicate that PArtNNer results in 10.2 × and 3.2 × (on average) and up to 21.1 × and 6.7 × improvements in end-to-end inference latency compared to execution of the DNN entirely on the edge device or entirely on a cloud server, respectively. Compared to pre-characterization-based partitioning approaches, PArtNNer converges to the optimal partitioning point 17.6 × faster.

nn-METER

nn-METER: Towards Accurate Latency Prediction of DNN Inference on Diverse Edge Devices.

nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

Accurate Deep Learning Inference Latency Prediction over Dynamic Running Mobile Devices

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

Towards A Flexible Accuracy-Oriented Deep Learning Module Inference Latency Prediction Framework for Adaptive Optimization Algorithms

Sectum: Accurate Latency Prediction for TEE-hosted Deep Learning Inference

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

MNN: A Universal and Efficient Inference Engine

Accelerate Intermittent Deep Inference

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

PerfSAGE: Generalized Inference Performance Predictor for Arbitrary Deep Learning Models on Edge Devices

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

Minimizing Latency for Multi-DNN Inference on Resource-Limited CPU-Only Edge Devices

Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

Ace-Sniper: Cloud-Edge Collaborative Scheduling Framework With DNN Inference Latency Modeling on Heterogeneous Devices

On Latency Predictors for Neural Architecture Search

Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices

Runtime Performance Prediction for Deep Learning Models with Graph Neural Network.

PArtNNer: Platform-agnostic Adaptive Edge-Cloud DNN Partitioning for minimizing End-to-End Latency