Abstract:International Journal of Software Engineering and Knowledge Engineering, Ahead of Print. In mobile edge computing environment, intelligent inference services driven by DNN are highly sensitive to latency. Recently, collaborative inference between User Devices and Edge Servers (ESs) based on Deep Neural Networks (DNN) partition has achieved success in service acceleration. However, most of the existing collaborative acceleration schemes are partitioned for a single DNN inference task, which cannot quickly make partition decisions for a set of concurrent inference tasks, and often sacrifice inference accuracy. In addition, due to the limited resources of ESs, there is resource competition among concurrent requests, which makes the partitioned tasks cannot be offloaded to ESs in time for processing. Therefore, designing an efficient offloading scheme becomes essential. The task offloading schemes based on deep reinforcement learning can solve complex decision-making problems in high-dimensional state space, but they have problems such as insufficient sample diversity and easily falling into local optimum. In this paper, a Collaborative Inference Acceleration Scheme integrating DNN Partitioning and Task Offloading (CIAS-PnO) is proposed. First, while ensuring inference accuracy, the Collaborative DNN Layer Partitioning (CDLP) algorithm is designed with the goal of optimal latency. CDLP can reduce the problem scale of concurrent inference tasks partition by pruning operation and determine the partition decisions in time. Then, the Distributed Soft Actor-Critic (SAC)-based Partition Task Offloading algorithm (DSACO) is designed. DSACO supports SAC Agents to explore samples in parallel and share learning experiences, and uses the automatic entropy adjustment mechanism to improve the exploration efficiency of Agents, so as to avoid falling into local optimum and achieve efficient offloading of partition tasks. Experimental results on DNN benchmarks show that compared with the baseline acceleration schemes, CIAS-PnO achieves more than 19.8% acceleration performance improvement, and has higher convergence performance and task success rate.

AIDA: Associative DNN Inference Accelerator

A Near Memory Computing FPGA Architecture for Neural Network Acceleration

DaDianNao: A Machine-Learning Supercomputer

ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

A Low-Latency DNN Accelerator Enabled by DFT-Based Convolution Execution Within Crossbar Arrays

MAICC : A Lightweight Many-core Architecture with In-Cache Computing for Multi-DNN Parallel Inference.

IDLA: an Instruction-based Adaptive CNN Accelerator

End-to-End DNN Inference on a Massively Parallel Analog In Memory Computing Architecture

Eidetic: An In-Memory Matrix Multiplication Accelerator for Neural Networks

Pie: A Pipeline Energy-Efficient Accelerator for Inference Process in Deep Neural Networks

ARAS: An Adaptive Low-Cost ReRAM-Based Accelerator for DNNs

Field-Programmable Deep Neural Network (DNN) Learning and Inference accelerator: a concept

Collaborative Inference Acceleration Integrating DNN Partitioning and Task Offloading in Mobile Edge Computing

Benchmark of the Compute-in-Memory-Based DNN Accelerator With Area Constraint

High-Performance Method and Architecture for Attention Computation in DNN Inference

EDEN: Enabling Energy-Efficient, High-Performance Deep Neural Network Inference Using Approximate DRAM

An Efficient Channel-Aware Sparse Binarized Neural Networks Inference Accelerator

An All-Digital Compute-In-Memory FPGA Architecture for Deep Learning Acceleration

ODIN: A Bit-Parallel Stochastic Arithmetic Based Accelerator for In-Situ Neural Network Processing in Phase Change RAM

ADA-GP: Accelerating DNN Training By Adaptive Gradient Prediction