Abstract:Model compression methods are being developed to bridge the gap between the massive scale of neural networks and the limited hardware resources on edge devices. Since most real-world applications deployed on resource-limited hardware platforms typically have multiple hardware constraints simultaneously, most existing model compression approaches that only consider optimizing one single hardware objective are ineffective. In this article, we propose an automated pruning method called multi-constrained model compression (MCMC) that allows for the optimization of multiple hardware targets, such as latency, floating point operations (FLOPs), and memory usage, while minimizing the impact on accuracy. Specifically, we propose an improved multi-objective reinforcement learning (MORL) algorithm, the one-stage envelope deep deterministic policy gradient (DDPG) algorithm, to determine the pruning strategy for neural networks. Our improved one-stage envelope DDPG algorithm reduces exploration time and offers greater flexibility in adjusting target priorities, enhancing its suitability for pruning tasks. For instance, on the visual geometry group (VGG)-16 network, our method achieved an 80% reduction in FLOPs, a 2.31× reduction in memory usage, and a 1.92× acceleration, with an accuracy improvement of 0.09% compared with the baseline. For larger datasets, such as ImageNet, we reduced FLOPs by 50% for MobileNet-V1, resulting in a 4.7× faster speed and 1.48× memory compression, while maintaining the same accuracy. When applied to edge devices, such as JETSON XAVIER NX, our method resulted in a 71% reduction in FLOPs for MobileNet-V1, leading to a 1.63× faster speed, 1.64× memory compression, and an accuracy improvement.

Adaptive ensemble optimization for memory-related hyperparameters in retraining DNN at edge

Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks

Condense: A Framework for Device and Frequency Adaptive Neural Network Models on the Edge.

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

MOC: Multi-Objective Mobile CPU-GPU Co-Optimization for Power-Efficient DNN Inference

Scaling Up Deep Neural Network Optimization for Edge Inference

AccEPT: an Acceleration Scheme for Speeding Up Edge Pipeline-parallel Training

FlexNN: Efficient and Adaptive DNN Inference on Memory-Constrained Edge Devices.

Achieving Pareto Optimality using Efficient Parameter Reduction for DNNs in Resource-Constrained Edge Environment

Optimizing for In-memory Deep Learning with Emerging Memory Technology

An Application-oblivious Memory Scheduling System for DNN Accelerators

pommDNN: Performance optimal GPU memory management for deep neural network training

Optimizing Off-Chip Memory Access for Deep Neural Network Accelerator

Adaptive Precision Training for Resource Constrained Devices

STR: Hybrid Tensor Re-Generation to Break Memory Wall for DNN Training

SwapNet: Efficient Swapping for DNN Inference on Edge AI Devices Beyond the Memory Budget

Pre-DNNOff: On-Demand DNN Model Offloading Method for Mobile Edge Computing

Low-Rank Training of Deep Neural Networks for Emerging Memory Technology

3U-EdgeAI: Ultra-Low Memory Training, Ultra-Low BitwidthQuantization, and Ultra-Low Latency Acceleration

Improving QoE of Deep Neural Network Inference on Edge Devices: A Bandit Approach

AntiDote: Attention-based Dynamic Optimization for Neural Network Runtime Efficiency