Abstract:The rapidly growing parameter volume of deep neural networks (DNNs) hinders the artificial intelligence applications on resource constrained devices, such as mobile and wearable devices. Neural network pruning, as one of the mainstream model compression techniques, is under extensive study to reduce the model size and thus the amount of computation. And thereby, the state-of-the-art DNNs are able to be deployed on those devices with high runtime energy efficiency. In contrast to irregular pruning that incurs high index storage and decoding overhead, structured pruning techniques have been proposed as the promising solutions. However, prior studies on structured pruning tackle the problem mainly from the perspective of facilitating hardware implementation, without diving into the deep to analyze the characteristics of sparse neural networks. The neglect on the study of sparse neural networks causes inefficient trade-off between regularity and pruning ratio. Consequently, the potential of structurally pruning neural networks is not sufficiently mined.In this work, we examine the structural characteristics of the irregularly pruned weight matrices, such as the diverse redundancy of different rows, the sensitivity of different rows to pruning, and the position characteristics of retained weights. By leveraging the gained insights as a guidance, we first propose the novel block-max weight masking (BMWM) method, which can effectively retain the salient weights while imposing high regularity to the weight matrix. As a further optimization, we propose a density-adaptive regular-block (DARB) pruning that can effectively take advantage of the intrinsic characteristics of neural networks, and thereby outperform prior structured pruning work with high pruning ratio and decoding efficiency. Our experimental results show that DARB can achieve 13× to 25× pruning ratio, which are 2.8× to 4.3× improvements than the state-of-the-art counterparts on multiple neural network models and tasks. Moreover, DARB can achieve 14.3× decoding efficiency than block pruning with higher pruning ratio.

BitXpro: Regularity-Aware Hardware Runtime Pruning for Deep Neural Networks

Class-Aware Pruning for Efficient Neural Networks

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration

Structured Term Pruning for Computational Efficient Neural Networks Inference

DPACS: Hardware Accelerated Dynamic Neural Network Pruning Through Algorithm-Architecture Co-design.

HBP: Hierarchically Balanced Pruning and Accelerator Co-Design for Efficient DNN Inference.

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

A 400mhz NPU with 7.8TOPS2/W High-PerformanceGuaranteed Efficiency in 55nm for Multi-Mode Pruning and Diverse Quantization Using Pattern-Kernel Encoding and Reconfigurable MAC Units

Bit-pragmatic Deep Neural Network Computing

A Flexible Yet Efficient DNN Pruning Approach for Crossbar-Based Processing-in-Memory Architectures.

PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

PowerPruning: Selecting Weights and Activations for Power-Efficient Neural Network Acceleration

Small-world-based Structural Pruning for Efficient FPGA Inference of Deep Neural Networks

DARB: A Density-Adaptive Regular-Block Pruning for Deep Neural Networks.

SCRA: Systolic-Friendly DNN Compression and Reconfigurable Accelerator Co-Design

Noise-Tolerant Hardware-Aware Pruning for Deep Neural Networks.

Separate, Dynamic and Differentiable (SMART) Pruner for Block/Output Channel Pruning on Computer Vision Tasks

Hardware-aware Approach to Deep Neural Network Optimization

Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training