Abstract:Mobile devices are becoming increasingly capable of delivering intelligent services by leveraging deep learning architectures such as deep neural networks (DNNs). However, due to the compute-intensive nature of these tasks, mobile devices often struggle to handle them independently, leading to the exploration of collaborative inference as a promising solution for achieving low-latency mobile intelligence. Despite its potential benefits, many challenges need to be addressed in realizing the full potential of inference acceleration. This paper presents a collaborative device-edge inference optimization framework as a promising solution to inference acceleration. The framework comprises fundamental modules, including the Parameters Generator, Accuracy Predictor, Delay Calculator, and Optimizer, which are specifically designed to identify the optimal set of parameters for Model Compression, DNN Partition, and Feature Compression. To illustrate its implementation, an example of a deep CNN network is introduced, and the collaborative inference latency optimization is formulated as a mixed-integer programming problem. The implementation of a specific algorithm instance using a quantum-inspired optimizer within the optimization framework is then presented. A multiple regression-based inference accuracy prediction model is proposed to maintain inference accuracy close to that of the original network while significantly reducing the time consumption during the offline phase. Through various simulation scenarios involving inference tasks of AlexNet and RegNet on CIFAR-10, incorporating diverse hardware computation specifications and wireless communication link conditions, the proposed framework demonstrates superior performance in terms of inference acceleration compared to the compared methods.

Aaron: Compile-time Kernel Adaptation for Multi-DNN Inference Acceleration on Edge GPU

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

Extendable Multi-Device Collaborative Pipeline Parallel Inference in the Edge-Cloud Scenario

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Edge Intelligence: On-Demand Deep Learning Model Co-Inference with Device-Edge Synergy

Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing

CoEdge: Cooperative DNN Inference With Adaptive Workload Partitioning Over Heterogeneous Edge Devices

Multi-Compression Scale DNN Inference Acceleration based on Cloud-Edge-End Collaboration

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

Minimizing Latency for Multi-DNN Inference on Resource-Limited CPU-Only Edge Devices

Bring Your Own Codegen to Deep Learning Compiler

Empowering In-Browser Deep Learning Inference on Edge Devices with Just-in-Time Kernel Optimizations

Joint Architecture Design and Workload Partitioning for DNN Inference on Industrial IoT Clusters

oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

DistrEdge: Speeding up Convolutional Neural Network Inference on Distributed Edge Devices

Adaptive Device-Edge Collaboration on DNN Inference in AIoT: A Digital Twin-Assisted Approach

Constructing an AI Compiler for ARM Cortex-M Devices

An Adaptive DNN Inference Acceleration Framework with End–edge–cloud Collaborative Computing

Context-Aware Compilation of DNN Training Pipelines across Edge and Cloud

AI on the Edge: Rethinking AI-based IoT Applications Using Specialized Edge Architectures