Abstract:International Journal of Software Engineering and Knowledge Engineering, Ahead of Print. In mobile edge computing environment, intelligent inference services driven by DNN are highly sensitive to latency. Recently, collaborative inference between User Devices and Edge Servers (ESs) based on Deep Neural Networks (DNN) partition has achieved success in service acceleration. However, most of the existing collaborative acceleration schemes are partitioned for a single DNN inference task, which cannot quickly make partition decisions for a set of concurrent inference tasks, and often sacrifice inference accuracy. In addition, due to the limited resources of ESs, there is resource competition among concurrent requests, which makes the partitioned tasks cannot be offloaded to ESs in time for processing. Therefore, designing an efficient offloading scheme becomes essential. The task offloading schemes based on deep reinforcement learning can solve complex decision-making problems in high-dimensional state space, but they have problems such as insufficient sample diversity and easily falling into local optimum. In this paper, a Collaborative Inference Acceleration Scheme integrating DNN Partitioning and Task Offloading (CIAS-PnO) is proposed. First, while ensuring inference accuracy, the Collaborative DNN Layer Partitioning (CDLP) algorithm is designed with the goal of optimal latency. CDLP can reduce the problem scale of concurrent inference tasks partition by pruning operation and determine the partition decisions in time. Then, the Distributed Soft Actor-Critic (SAC)-based Partition Task Offloading algorithm (DSACO) is designed. DSACO supports SAC Agents to explore samples in parallel and share learning experiences, and uses the automatic entropy adjustment mechanism to improve the exploration efficiency of Agents, so as to avoid falling into local optimum and achieve efficient offloading of partition tasks. Experimental results on DNN benchmarks show that compared with the baseline acceleration schemes, CIAS-PnO achieves more than 19.8% acceleration performance improvement, and has higher convergence performance and task success rate.

On-demand Edge Inference Scheduling with Accuracy and Deadline Guarantee.

Dynamic Batching and Early-Exiting for Accurate and Timely Edge Inference

Adaptive Scheduling for Edge-Assisted DNN Serving

BCEdge: SLO-Aware DNN Inference Services with Adaptive Batching on Edge Platforms

A DNN inference acceleration algorithm combining model partition and task allocation in heterogeneous edge computing system

BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge Devices

Distributed Assignment With Load Balancing for DNN Inference at the Edge

Online Approximation Scheme for Scheduling Heterogeneous Utility Jobs in Edge Computing

Load scheduling for distributed edge computing: A communication-computation tradeoff

When CPN Meets AI: Resource Provisioning for Inference Query Upon Computing Power Network

Provisioning Edge Inference As a Service Via Online Learning.

Kalmia: A Heterogeneous QoS-aware Scheduling Framework for DNN Tasks on Edge Servers

Integrated Quality of Service for Offline and Online Services in Edge Networks via Task Offloading and Service Caching

Online Deadline-Aware Task Dispatching and Scheduling in Edge Computing

Resource Allocation for Multiuser Edge Inference with Batching and Early Exiting (Extended Version)

Deep Learning-Assisted Online Task Offloading for Latency Minimization in Heterogeneous Mobile Edge

An Online Approach for DNN Model Caching and Processor Allocation in Edge Computing

Joint Task Offloading and Resource Allocation for Quality-Aware Edge-Assisted Machine Learning Task Inference

Scheduling Inference Workloads on Distributed Edge Clusters with Reinforcement Learning

Collaborative Inference Acceleration Integrating DNN Partitioning and Task Offloading in Mobile Edge Computing

Energy-Efficient DNN Partitioning and Offloading for Task Completion Rate Maximization in Multiuser Edge Intelligence