Abstract:Convolutional neural networks have shown tremendous performance capabilities in computer vision tasks, but their excessive amounts of weight storage and arithmetic operations prevent them from being adopted in embedded environments. One of the solutions involves pruning, where certain unimportant weights are forced to have a value of zero. Many pruning schemes have been proposed, but these have mainly focused on the number of pruned weights. Previous pruning schemes scarcely considered ASIC or FPGA accelerator architectures. When these pruned networks are run on accelerators, the lack of consideration of the architecture causes some inefficiency problems, including internal buffer misalignments and load imbalances. This paper proposes a new pruning scheme that reflects accelerator architectures. In the proposed scheme, pruning is performed so that the same number of weights remain for each weight group corresponding to activations fetched simultaneously. In this way, the pruning scheme resolves the inefficiency problems, doubling the accelerator performance. Even with this constraint, the proposed pruning scheme reached a pruning ratio similar to that of previous unconstrained pruning schemes, not only on AlexNet and VGG16 but also on state-of-the-art very deep networks such as ResNet. Furthermore, the proposed scheme demonstrated a comparable pruning ratio on compact networks such as MobileNet and on slimmed networks that were already pruned in a channel-wise manner. In addition to improving the efficiency of previous sparse accelerators, it will be also shown that the proposed pruning scheme can be used to reduce the logic complexity of sparse <a class="link-external link-http" href="http://accelerators.The" rel="external noopener nofollow">this http URL</a> pruned models are publicly available at <a class="link-external link-https" href="https://github.com/HyeongjuKang/accelerator-aware-pruning" rel="external noopener nofollow">this https URL</a>.

Realizing Unaligned Block-wise Pruning for DNN Acceleration on Mobile Devices

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

Class-Aware Pruning for Efficient Neural Networks

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

SUBP: Soft Uniform Block Pruning for 1 X N Sparse CNNs Multithreading Acceleration

Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment

Towards Real-Time DNN Inference on Mobile Platforms with Model Pruning and Compiler Optimization

Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration

Improving Device-Edge Cooperative Inference of Deep Learning via 2-Step Pruning

DBP: Discrimination Based Block-Level Pruning for Deep Model Acceleration.

PCONV: the Missing but Desirable Sparsity in DNN Weight Pruning for Real-Time Execution on Mobile Devices.

DARB: A Density-Aware Regular-Block Pruning for Deep Neural Networks

PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

Efficient DNN Neuron Pruning by Minimizing Layer-wise Nonlinear Reconstruction Error

Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework

ABCP: Automatic Block-wise and Channel-wise Network Pruning via Joint Search

Cloud–Edge Collaborative Inference with Network Pruning

Accelerator-Aware Pruning for Convolutional Neural Networks

Implication of Optimizing NPU Dataflows on Neural Architecture Search for Mobile Devices

Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks

An efficient GPU-accelerated inference engine for binary neural network on mobile phones