Abstract:International Journal of Software Engineering and Knowledge Engineering, Ahead of Print. In mobile edge computing environment, intelligent inference services driven by DNN are highly sensitive to latency. Recently, collaborative inference between User Devices and Edge Servers (ESs) based on Deep Neural Networks (DNN) partition has achieved success in service acceleration. However, most of the existing collaborative acceleration schemes are partitioned for a single DNN inference task, which cannot quickly make partition decisions for a set of concurrent inference tasks, and often sacrifice inference accuracy. In addition, due to the limited resources of ESs, there is resource competition among concurrent requests, which makes the partitioned tasks cannot be offloaded to ESs in time for processing. Therefore, designing an efficient offloading scheme becomes essential. The task offloading schemes based on deep reinforcement learning can solve complex decision-making problems in high-dimensional state space, but they have problems such as insufficient sample diversity and easily falling into local optimum. In this paper, a Collaborative Inference Acceleration Scheme integrating DNN Partitioning and Task Offloading (CIAS-PnO) is proposed. First, while ensuring inference accuracy, the Collaborative DNN Layer Partitioning (CDLP) algorithm is designed with the goal of optimal latency. CDLP can reduce the problem scale of concurrent inference tasks partition by pruning operation and determine the partition decisions in time. Then, the Distributed Soft Actor-Critic (SAC)-based Partition Task Offloading algorithm (DSACO) is designed. DSACO supports SAC Agents to explore samples in parallel and share learning experiences, and uses the automatic entropy adjustment mechanism to improve the exploration efficiency of Agents, so as to avoid falling into local optimum and achieve efficient offloading of partition tasks. Experimental results on DNN benchmarks show that compared with the baseline acceleration schemes, CIAS-PnO achieves more than 19.8% acceleration performance improvement, and has higher convergence performance and task success rate.

Delay-Aware DNN Inference Throughput Maximization in Edge Computing Via Jointly Exploring Partitioning and Parallelism

Throughput Maximization of Delay-Aware DNN Inference in Edge Computing by Exploring DNN Model Partitioning and Inference Parallelism

Efficient Partitioning and Communication Scheme-Based Distributed Edge Computing to Accelerate Deep Neural Network

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Accelerating DNN Inference by Edge-Cloud Collaboration

Distributed DNN Inference with Fine-grained Model Partitioning in Mobile Edge Computing Networks

DECC: Delay-Aware Edge-Cloud Collaboration for Accelerating DNN Inference

Joint DNN Partition Deployment and Resource Allocation for Delay-Sensitive Deep Learning Inference in IoT

Joint DNN partitioning and resource allocation for completion rate maximization of delay-aware DNN inference tasks in wireless powered mobile edge computing

Deep Neural Network Task Partitioning and Offloading for Mobile Edge Computing

DNN Inference Acceleration with Partitioning and Early Exiting in Edge Computing

Collaborative Deep Neural Network Inference via Mobile Edge Computing

Towards Real-Time Inference Offloading with Distributed Edge Computing: the Framework and Algorithms

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

End-to-End Delay Minimization based on Joint Optimization of DNN Partitioning and Resource Allocation for Cooperative Edge Inference

Enabling Latency-Sensitive DNN Inference Via Joint Optimization of Model Surgery and Resource Allocation in Heterogeneous Edge

Hastening Stream Offloading of Inference Via Multi-Exit DNNs in Mobile Edge Computing

Accelerating Deep Learning Inference via Model Parallelism and Partial Computation Offloading

Collaborative Inference Acceleration Integrating DNN Partitioning and Task Offloading in Mobile Edge Computing

Joint multi-user DNN partitioning and task offloading in mobile edge computing

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration