Abstract:During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.

Performance Characterization and Optimization of Pruning Patterns for Sparse DNN Inference

Class-Aware Pruning for Efficient Neural Networks

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

A Feature-map Discriminant Perspective for Pruning Deep Neural Networks

PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

Filter Pruning Via Feature Map Clustering.

An Image Enhancing Pattern-Based Sparsity for Real-Time Inference on Mobile Devices

Network Pruning Spaces

1×N Block Pattern for Network Sparsity

1xN Pattern for Pruning Convolutional Neural Networks

Structured Term Pruning for Computational Efficient Neural Networks Inference

1$\Times$n Block Pattern for Network Sparsity

Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration

Towards Optimal Filter Pruning with Balanced Performance and Pruning Speed

Towards Optimal Filter Pruning with Balanced Performance and Pruning Speed.

A Dynamic Pruning Method on Multiple Sparse Structures in Deep Neural Networks

Rethinking the Mechanism of the Pattern Pruning and the Circle Importance Hypothesis

Investigating the Effect of Network Pruning on Performance and Interpretability

Cloud–Edge Collaborative Inference with Network Pruning