Abstract:International Journal of Software Engineering and Knowledge Engineering, Ahead of Print. In mobile edge computing environment, intelligent inference services driven by DNN are highly sensitive to latency. Recently, collaborative inference between User Devices and Edge Servers (ESs) based on Deep Neural Networks (DNN) partition has achieved success in service acceleration. However, most of the existing collaborative acceleration schemes are partitioned for a single DNN inference task, which cannot quickly make partition decisions for a set of concurrent inference tasks, and often sacrifice inference accuracy. In addition, due to the limited resources of ESs, there is resource competition among concurrent requests, which makes the partitioned tasks cannot be offloaded to ESs in time for processing. Therefore, designing an efficient offloading scheme becomes essential. The task offloading schemes based on deep reinforcement learning can solve complex decision-making problems in high-dimensional state space, but they have problems such as insufficient sample diversity and easily falling into local optimum. In this paper, a Collaborative Inference Acceleration Scheme integrating DNN Partitioning and Task Offloading (CIAS-PnO) is proposed. First, while ensuring inference accuracy, the Collaborative DNN Layer Partitioning (CDLP) algorithm is designed with the goal of optimal latency. CDLP can reduce the problem scale of concurrent inference tasks partition by pruning operation and determine the partition decisions in time. Then, the Distributed Soft Actor-Critic (SAC)-based Partition Task Offloading algorithm (DSACO) is designed. DSACO supports SAC Agents to explore samples in parallel and share learning experiences, and uses the automatic entropy adjustment mechanism to improve the exploration efficiency of Agents, so as to avoid falling into local optimum and achieve efficient offloading of partition tasks. Experimental results on DNN benchmarks show that compared with the baseline acceleration schemes, CIAS-PnO achieves more than 19.8% acceleration performance improvement, and has higher convergence performance and task success rate.

PAME: Precision-Aware Multi-Exit DNN Serving for Reducing Latencies of Batched Inferences

Elastic DNN Inference with Unpredictable Exit in Edge Computing

Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

Multi-exit DNN inference acceleration for intelligent terminal with heterogeneous processors

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving

Collaborative Inference Acceleration Integrating DNN Partitioning and Task Offloading in Mobile Edge Computing

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

P/D-Serve: Serving Disaggregated Large Language Model at Scale

Towards A Flexible Accuracy-Oriented Deep Learning Module Inference Latency Prediction Framework for Adaptive Optimization Algorithms

CAP: Communication-aware Automated Parallelization for Deep Learning Inference on CMP Architectures

A Collaborative PIM Computing Optimization Framework for Multi-Tenant DNN

Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching

Multi-Model Running Latency Optimization in an Edge Computing Paradigm

BCEdge: SLO-Aware DNN Inference Services With Adaptive Batch-Concurrent Scheduling on Edge Devices

Pre-DNNOff: On-Demand DNN Model Offloading Method for Mobile Edge Computing

A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs