Abstract:Edge inference has become more widespread, as its diverse applications range from retail to wearable technology. Clusters of networked resource-constrained edge devices are becoming common, yet no system exists to split a DNN across these clusters while maximizing the inference throughput of the system. Additionally, no production-ready orchestration system exists for deploying said models over such edge networks which adopts the robustness and scalability of the cloud. We present an algorithm which partitions DNNs and distributes them across a set of edge devices with the goal of minimizing the bottleneck latency and therefore maximizing inference throughput. The system scales well to systems of different node memory capacities and numbers of nodes, while being node fault-tolerant. We find that we can reduce the bottleneck latency by 10x over a random algorithm and 35% over a greedy joint partitioning-placement algorithm, although the joint-partitioning algorithm outperforms our algorithm in most practical use-cases. Furthermore we find empirically that for the set of representative models we tested, the algorithm produces results within 9.2% of the optimal bottleneck latency. We then developed a standalone cluster network emulator on which we tested configurations of up to 20 nodes and found a steady increase in throughput and decrease in end-to-end latency as the cluster size scales. In these tests, we observed that our system has multi-node fault-tolerance as well as network and system IO fault-tolerance. We have implemented our framework in open-source software that is publicly available to the research community at <a class="link-external link-https" href="https://github.com/ANRGUSC/SEIFER" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to efficiently deploy and run deep neural networks (DNN) on a cluster of resource - constrained edge devices to maximize the system inference throughput. Specifically, the paper focuses on the following two main issues: 1. **How to utilize high - efficiency DNN inference in a multi - device edge cluster**: The paper proposes an algorithm that can split the DNN into multiple parts and assign these parts to different edge devices for execution, thus forming an inference pipeline. In this way, even if the computing power of each node is low, the throughput of the entire system can be significantly improved. The goal of the paper is to maximize the inference throughput by minimizing the bottleneck delay. The bottleneck delay is defined as: \[ \beta = \max_{k \in [K]} \gamma_k \] where \(\gamma_k\) represents the communication time between the \(k\) - th node and the next node. 2. **How to integrate the principles of cloud computing**: The paper also explores how to apply the characteristics such as high availability and fault tolerance in cloud computing to edge inference to make it more feasible in the production environment. Specifically, the paper proposes a robust, container - based system that can automatically recover in case of node failure or network failure and can dynamically adjust the model partition according to the model update. ### Main contributions 1. **Partitioning and placement algorithm**: The paper proposes a partitioning and placement algorithm for DNN, which can find the optimal partitioning points and assign these partitions to the nodes with the highest bandwidth to minimize the bottleneck delay. 2. **Robust containerized system**: The paper develops a container - based system that has the ability of node fault tolerance and dynamic update of model partitions. This system takes into account the system resource limitations and provides a lightweight inference runtime environment. ### Experimental results - Based on the random partitioning/placement algorithm, the algorithm reduces the bottleneck delay by 10 times, and reduces it by 35% based on the greedy joint partitioning/placement algorithm. - For the representative models tested, the bottleneck delay generated by the algorithm is on average 9.2% higher than the optimal value. - The containerized system has multi - node fault tolerance and can recover from network and system IO failures. ### Related work - **DNN model splitting**: Some studies mathematically split the DNN model through the inter - layer calculation influence, but do not consider the communication requirements of edge devices. Other methods abstract the model layers into "execution units" and split them according to the resource requirements. These methods mainly optimize the hybrid edge - cloud pipeline and are not suitable for edge device clusters. - **Edge inference runtime**: Some frameworks optimize the energy use on edge devices, but mainly focus on model compression and pruning. Other frameworks focus on the task scheduling of geographically distributed computing nodes, but do not consider the bandwidth limitations in edge device clusters. ### Conclusion By proposing an efficient DNN partitioning and placement algorithm and a robust containerized system, the paper successfully solves the problem of efficiently deploying and running DNN on a cluster of resource - constrained edge devices, providing important technical support for practical applications in the field of edge computing.

Partitioning and Deployment of Deep Neural Networks on Edge Clusters

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Model Parallelism Optimization for Distributed DNN Inference on Edge Devices.

PArtNNer: Platform-agnostic Adaptive Edge-Cloud DNN Partitioning for minimizing End-to-End Latency

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

Scission: Performance-driven and Context-aware Cloud-Edge Distribution of Deep Neural Networks

Automated Deep Neural Network Inference Partitioning for Distributed Embedded Systems

Joint multi-user DNN partitioning and task offloading in mobile edge computing

DEFER: Distributed Edge Inference for Deep Neural Networks

EdgeCI: Distributed Workload Assignment and Model Partitioning for CNN Inference on Edge Clusters

Partitioning DNNs for Optimizing Distributed Inference Performance on Cooperative Edge Devices: A Genetic Algorithm Approach

DNN Inference Acceleration with Partitioning and Early Exiting in Edge Computing

HiDP: Hierarchical DNN Partitioning for Distributed Inference on Heterogeneous Edge Platforms

Joint Optimization of Device Placement and Model Partitioning for Cooperative DNN Inference in Heterogeneous Edge Computing

Communication-Efficient Separable Neural Network for Distributed Inference on Edge Devices

Joint Optimization of Model Partitioning and Resource Allocation for Edge Computing with Intermittently Operating Devices

EdgeLD: Locally Distributed Deep Learning Inference on Edge Device Clusters

Low Latency Deep Learning Inference Model for Distributed Intelligent IoT Edge Clusters

Dynamic Split Computing for Efficient Deep EDGE Intelligence