Abstract:The rapidly growing parameter volume of deep neural networks (DNNs) hinders the artificial intelligence applications on resource constrained devices, such as mobile and wearable devices. Neural network pruning, as one of the mainstream model compression techniques, is under extensive study to reduce the number of parameters and computations. In contrast to irregular pruning that incurs high index storage and decoding overhead, structured pruning techniques have been proposed as the promising solutions. However, prior studies on structured pruning tackle the problem mainly from the perspective of facilitating hardware implementation, without analyzing the characteristics of sparse neural networks. The neglect on the study of sparse neural networks causes inefficient trade-off between regularity and pruning ratio. Consequently, the potential of structurally pruning neural networks is not sufficiently mined. In this work, we examine the structural characteristics of the irregularly pruned weight matrices, such as the diverse redundancy of different rows, the sensitivity of different rows to pruning, and the positional characteristics of retained weights. By leveraging the gained insights as a guidance, we first propose the novel block-max weight masking (BMWM) method, which can effectively retain the salient weights while imposing high regularity to the weight matrix. As a further optimization, we propose a density-adaptive regular-block (DARB) pruning that outperforms prior structured pruning work with high pruning ratio and decoding efficiency. Our experimental results show that DARB can achieve 13$\times$ to 25$\times$ pruning ratio, which are 2.8$\times$ to 4.3$\times$ improvements than the state-of-the-art counterparts on multiple neural network models and tasks. Moreover, DARB can achieve 14.3$\times$ decoding efficiency than block pruning with higher pruning ratio.

SMVAR: A Novel RNN Accelerator Based on Non-blocking Data Distribution Structure

SUBP: Soft Uniform Block Pruning for 1 X N Sparse CNNs Multithreading Acceleration

SUBP: Soft Uniform Block Pruning for 1xn Sparse CNNs Multithreading Acceleration

Structured Probabilistic Pruning for Convolutional Neural Network Acceleration.

MASR: A Modular Accelerator for Sparse RNNs

Learning the sparsity for ReRAM - mapping and pruning sparse neural network for ReRAM based accelerator.

Block-Sparse Recurrent Neural Networks

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices based on Fine-Grained Structured Weight Sparsity

Structured Pruning of Recurrent Neural Networks through Neuron Selection

SNrram: an Efficient Sparse Neural Network Computation Architecture Based on Resistive Random-Access Memory.

APQ: Automated DNN Pruning and Quantization for ReRAM-Based Accelerators

AUTO-PRUNE

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

DARB: A Density-Aware Regular-Block Pruning for Deep Neural Networks

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

Crane: Mitigating Accelerator Under-utilization Caused by Sparsity Irregularities in CNNs

SPAT: FPGA-based Sparsity-Optimized Spiking Neural Network Training Accelerator with Temporal Parallel Dataflow

Learning the Sparsity for ReRAM

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation