Abstract:Convolutional neural networks have shown tremendous performance capabilities in computer vision tasks, but their excessive amounts of weight storage and arithmetic operations prevent them from being adopted in embedded environments. One of the solutions involves pruning, where certain unimportant weights are forced to have a value of zero. Many pruning schemes have been proposed, but these have mainly focused on the number of pruned weights. Previous pruning schemes scarcely considered ASIC or FPGA accelerator architectures. When these pruned networks are run on accelerators, the lack of consideration of the architecture causes some inefficiency problems, including internal buffer misalignments and load imbalances. This paper proposes a new pruning scheme that reflects accelerator architectures. In the proposed scheme, pruning is performed so that the same number of weights remain for each weight group corresponding to activations fetched simultaneously. In this way, the pruning scheme resolves the inefficiency problems, doubling the accelerator performance. Even with this constraint, the proposed pruning scheme reached a pruning ratio similar to that of previous unconstrained pruning schemes, not only on AlexNet and VGG16 but also on state-of-the-art very deep networks such as ResNet. Furthermore, the proposed scheme demonstrated a comparable pruning ratio on compact networks such as MobileNet and on slimmed networks that were already pruned in a channel-wise manner. In addition to improving the efficiency of previous sparse accelerators, it will be also shown that the proposed pruning scheme can be used to reduce the logic complexity of sparse <a class="link-external link-http" href="http://accelerators.The" rel="external noopener nofollow">this http URL</a> pruned models are publicly available at <a class="link-external link-https" href="https://github.com/HyeongjuKang/accelerator-aware-pruning" rel="external noopener nofollow">this https URL</a>.

Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration

Class-Aware Pruning for Efficient Neural Networks

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

A Feature-map Discriminant Perspective for Pruning Deep Neural Networks

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

Structured Deep Neural Network Pruning by Varying Regularization Parameters.

Structured Pruning for Efficient Convolutional Neural Networks Via Incremental Regularization

Loss Constrains Added Squeeze and Excitation Blocks for Pruning Deep Neural Networks

PCONV: the Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices.

An Image Enhancing Pattern-Based Sparsity for Real-Time Inference on Mobile Devices

SS-Auto: A Single-Shot, Automatic Structured Weight Pruning Framework of DNNs with Ultra-High Efficiency

Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization

Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment

PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

A Compact Parallel Pruning Scheme for Deep Learning Model and Its Mobile Instrument Deployment

StructADMM: A Systematic, High-Efficiency Framework of Structured Weight Pruning for DNNs

AutoCompress: An Automatic DNN Structured Pruning Framework for Ultra-High Compression Rates

Accelerator-Aware Pruning for Convolutional Neural Networks

Progressive DNN Compression: A Key to Achieve Ultra-High Weight Pruning and Quantization Rates using ADMM

A systematic DNN weight pruning framework based on symmetric accelerated stochastic ADMM

Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks