Abstract:The last decade has seen the emergence of Deep Neural Networks (DNNs) as the de facto algorithm for various computer vision applications. In intelligent edge devices, sensor data streams acquired by the device are processed by a DNN application running on either the edge device itself or in the cloud. However, ‘edge-only’ and ‘cloud-only’ execution of State-of-the-Art DNNs may not meet an application’s latency requirements due to the limited compute, memory, and energy resources in edge devices, dynamically varying bandwidth of edge-cloud connectivity networks, and temporal variations in the computational load of cloud servers. This work investigates distributed (partitioned) inference across edge devices (mobile/end device) and cloud servers to minimize end-to-end DNN inference latency. We study the impact of temporally varying operating conditions and the underlying compute and communication architecture on the decision of whether to run the inference solely on the edge, entirely in the cloud, or by partitioning the DNN model execution among the two. Leveraging the insights gained from this study and the wide variation in the capabilities of various edge platforms that run DNN inference, we propose PArtNNer , a platform-agnostic adaptive DNN partitioning algorithm that finds the optimal partitioning point in DNNs to minimize inference latency. PArtNNer can adapt to dynamic variations in communication bandwidth and cloud server load without requiring pre-characterization of underlying platforms. Experimental results for six image classification and object detection DNNs on a set of five commercial off-the-shelf compute platforms and three communication standards indicate that PArtNNer results in 10.2 × and 3.2 × (on average) and up to 21.1 × and 6.7 × improvements in end-to-end inference latency compared to execution of the DNN entirely on the edge device or entirely on a cloud server, respectively. Compared to pre-characterization-based partitioning approaches, PArtNNer converges to the optimal partitioning point 17.6 × faster.

DNNSplit: Latency and Cost-Efficient Split Point Identification for Multi-Tier DNN Partitioning

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Dynamic DNN Decomposition for Lossless Synergistic Inference

Optimum splitting computing for DNN training through next generation smart networks: a multi-tier deep reinforcement learning approach

Scission: Performance-driven and Context-aware Cloud-Edge Distribution of Deep Neural Networks

DynaSplit: A Hardware-Software Co-Design Framework for Energy-Aware Inference on Edge

Distilled Split Deep Neural Networks for Edge-Assisted Real-Time Systems

Distributed Inference Acceleration with Adaptive DNN Partitioning and Offloading

PArtNNer: Platform-agnostic Adaptive Edge-Cloud DNN Partitioning for minimizing End-to-End Latency

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training Via Tensor Splitting

SplitNets: Designing Neural Architectures for Efficient Distributed Computing on Head-Mounted Systems

Fast and fair split computing for accelerating deep neural network (DNN) inference

A Survey on Deep Neural Network Partition over Cloud, Edge and End Devices

DNN Surgery: Accelerating DNN Inference on the Edge Through Layer Partitioning

HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms

Decentralized Proactive Model Offloading and Resource Allocation for Split and Federated Learning

NEUKONFIG: Reducing Edge Service Downtime When Repartitioning DNNs

SplitPlace: AI Augmented Splitting and Placement of Large-Scale Neural Networks in Mobile Edge Environments

Inference Time Optimization Using BranchyNet Partitioning

Partitioning and Deployment of Deep Neural Networks on Edge Clusters