Partitioning and Deployment of Deep Neural Networks on Edge Clusters

Arjun Parthasarathy,Bhaskar Krishnamachari
2023-04-24
Abstract:Edge inference has become more widespread, as its diverse applications range from retail to wearable technology. Clusters of networked resource-constrained edge devices are becoming common, yet no system exists to split a DNN across these clusters while maximizing the inference throughput of the system. Additionally, no production-ready orchestration system exists for deploying said models over such edge networks which adopts the robustness and scalability of the cloud. We present an algorithm which partitions DNNs and distributes them across a set of edge devices with the goal of minimizing the bottleneck latency and therefore maximizing inference throughput. The system scales well to systems of different node memory capacities and numbers of nodes, while being node fault-tolerant. We find that we can reduce the bottleneck latency by 10x over a random algorithm and 35% over a greedy joint partitioning-placement algorithm, although the joint-partitioning algorithm outperforms our algorithm in most practical use-cases. Furthermore we find empirically that for the set of representative models we tested, the algorithm produces results within 9.2% of the optimal bottleneck latency. We then developed a standalone cluster network emulator on which we tested configurations of up to 20 nodes and found a steady increase in throughput and decrease in end-to-end latency as the cluster size scales. In these tests, we observed that our system has multi-node fault-tolerance as well as network and system IO fault-tolerance. We have implemented our framework in open-source software that is publicly available to the research community at <a class="link-external link-https" href="https://github.com/ANRGUSC/SEIFER" rel="external noopener nofollow">this https URL</a>.
Networking and Internet Architecture
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to efficiently deploy and run deep neural networks (DNN) on a cluster of resource - constrained edge devices to maximize the system inference throughput. Specifically, the paper focuses on the following two main issues: 1. **How to utilize high - efficiency DNN inference in a multi - device edge cluster**: The paper proposes an algorithm that can split the DNN into multiple parts and assign these parts to different edge devices for execution, thus forming an inference pipeline. In this way, even if the computing power of each node is low, the throughput of the entire system can be significantly improved. The goal of the paper is to maximize the inference throughput by minimizing the bottleneck delay. The bottleneck delay is defined as: \[ \beta = \max_{k \in [K]} \gamma_k \] where \(\gamma_k\) represents the communication time between the \(k\) - th node and the next node. 2. **How to integrate the principles of cloud computing**: The paper also explores how to apply the characteristics such as high availability and fault tolerance in cloud computing to edge inference to make it more feasible in the production environment. Specifically, the paper proposes a robust, container - based system that can automatically recover in case of node failure or network failure and can dynamically adjust the model partition according to the model update. ### Main contributions 1. **Partitioning and placement algorithm**: The paper proposes a partitioning and placement algorithm for DNN, which can find the optimal partitioning points and assign these partitions to the nodes with the highest bandwidth to minimize the bottleneck delay. 2. **Robust containerized system**: The paper develops a container - based system that has the ability of node fault tolerance and dynamic update of model partitions. This system takes into account the system resource limitations and provides a lightweight inference runtime environment. ### Experimental results - Based on the random partitioning/placement algorithm, the algorithm reduces the bottleneck delay by 10 times, and reduces it by 35% based on the greedy joint partitioning/placement algorithm. - For the representative models tested, the bottleneck delay generated by the algorithm is on average 9.2% higher than the optimal value. - The containerized system has multi - node fault tolerance and can recover from network and system IO failures. ### Related work - **DNN model splitting**: Some studies mathematically split the DNN model through the inter - layer calculation influence, but do not consider the communication requirements of edge devices. Other methods abstract the model layers into "execution units" and split them according to the resource requirements. These methods mainly optimize the hybrid edge - cloud pipeline and are not suitable for edge device clusters. - **Edge inference runtime**: Some frameworks optimize the energy use on edge devices, but mainly focus on model compression and pruning. Other frameworks focus on the task scheduling of geographically distributed computing nodes, but do not consider the bandwidth limitations in edge device clusters. ### Conclusion By proposing an efficient DNN partitioning and placement algorithm and a robust containerized system, the paper successfully solves the problem of efficiently deploying and running DNN on a cluster of resource - constrained edge devices, providing important technical support for practical applications in the field of edge computing.