A generic deep learning architecture optimization method for edge device based on start-up latency reduction

Qi Li,Hengyi Li,Lin Meng

DOI: https://doi.org/10.1007/s11554-024-01496-8

IF: 2.293

2024-06-20

Journal of Real-Time Image Processing

Abstract:In the promising Artificial Intelligence of Things technology, deep learning algorithms are implemented on edge devices to process data locally. However, high-performance deep learning algorithms are accompanied by increased computation and parameter storage costs, leading to difficulties in implementing huge deep learning algorithms on memory and power constrained edge devices, such as smartphones and drones. Thus various compression methods are proposed, such as channel pruning. According to the analysis of low-level operations on edge devices, existing channel pruning methods have limited effect on latency optimization. Due to data processing operations, the pruned residual blocks still result in significant latency, which hinders real-time processing of CNNs on edge devices. Hence, we propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The network is optimized in two stages, Global Constraint and Start-up Latency Reduction, and pruning of both channels and residual blocks is achieved. Optimized networks are evaluated on desktop CPU, FPGA, ARM CPU, and PULP platforms. The experimental results show that the latency is reduced by up to 70.40%, which is 13.63% higher than only applying channel pruning and achieving real-time processing in the edge device.

computer science, artificial intelligence,engineering, electrical & electronic,imaging science & photographic technology

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the latency issues encountered when implementing high-performance deep learning algorithms on edge devices (such as smartphones and drones). Specifically, the paper proposes a general deep learning architecture optimization method to further accelerate the execution speed of algorithms on edge devices by reducing start-up latency. #### Background and Challenges In Internet of Things (IoT) technology, deep learning algorithms often need to process data locally on edge devices. However, high-performance deep learning algorithms come with significant computational and parameter storage demands, making it challenging to deploy large-scale deep learning algorithms on edge devices with limited memory and power consumption. Existing compression methods (such as channel pruning) can reduce computational load to some extent, but significant latency issues still exist in practical operations, especially the start-up latency caused by data processing operations in convolutional layers. #### Research Objectives The paper proposes a two-stage deep learning architecture optimization method: 1. **Global Constraint (GC) Stage**: Achieves lossless channel pruning by adding constraints to the main paths in the network. 2. **Start-up Latency Reduction (SLR) Stage**: Further identifies and prunes residual blocks that cannot work efficiently due to constraints, to reduce start-up latency. Through this method, the optimized network achieved significant latency reduction on various platforms. Experimental results show that latency was reduced by up to 70.40%, which is 13.63% higher than the method that only applies channel pruning, and achieved real-time processing capability on edge devices.

A generic deep learning architecture optimization method for edge device based on start-up latency reduction

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

Efficient Hardware Acceleration Techniques for Deep Learning on Edge Devices: A Comprehensive Performance Analysis

Cloud–Edge Collaborative Inference with Network Pruning

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low BitwidthQuantization, and Ultra-Low Latency Acceleration

Optimizing deep neural networks on intelligent edge accelerators via flexible-rate filter pruning

Minimizing Latency for Multi-DNN Inference on Resource-Limited CPU-Only Edge Devices

Enabling High Performance Deep Learning Networks on Embedded Systems

Improving Device-Edge Cooperative Inference of Deep Learning via 2-Step Pruning

Resource and Data Optimization for Hardware Implementation of Deep Neural Networks Targeting FPGA-based Edge Devices.

A Fine-Grained End-to-End Latency Optimization Framework for Wireless Collaborative Inference

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

Latency optimized Deep Neural Networks (DNNs): An Artificial Intelligence approach at the Edge using Multiprocessor System on Chip (MPSoC)

Enabling Deep Learning on Edge Devices

A deep neural network compression algorithm based on knowledge transfer for edge devices

PowerPruning: Selecting Weights and Activations for Power-Efficient Neural Network Acceleration

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip