Abstract:By eliminating compute operations intelligently based on the run time input, dynamic pruning (DP) promises to improve deep neural network inference speed substantially without incurring a major impact on their accuracy. Although many DP algorithms with good pruning performance have been proposed, it remains a challenge to translate these theoretical reductions in compute operations into satisfactory end-to-end speedups in practical real-world implementations. The overhead of identifying operations to be pruned during run time, the need to efficiently process the resulting dynamic dataflow, and the non-trivial memory I/O bottleneck that emerges as the number of compute operations reduces, have all contributed to the challenge of implementing practical DP systems. In this paper, the design and implementation of DPACS are presented to address these challenges. DPACS utilizes a hardware-aware dynamic spatial and channel pruning algorithm in conjunction with a dynamic dataflow engine in hardware to facilitate efficient processing of the pruned network. A channel mask precomputation scheme is designed to reduce memory I/O, and a dedicated inter-layer pipeline is used to achieve efficient indexing and dataflow of sparse activation. Extensive design space exploration has been performed using two architectural variations implemented on FPGA to accelerate multiple networks from the ResNet family on the ImageNet and CIFAR10 dataset across a wide range of pruning ratios. Across the spectrum of configurations, DPACS is able to achieve 1.1× to 3.9× end-to-end speedup over a baseline hardware implementation without pruning. Analysis of the tradeoff among accuracy, compute, and memory I/O performance highlights the importance of algorithm-architecture codesign in developing DP systems.

Rethinking Pruning for Accelerating Deep Inference at the Edge

Class-Aware Pruning for Efficient Neural Networks

Loss Constrains Added Squeeze and Excitation Blocks for Pruning Deep Neural Networks

A Feature-map Discriminant Perspective for Pruning Deep Neural Networks

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management

Cloud–Edge Collaborative Inference with Network Pruning

NEPENTHE: Entropy-Based Pruning as a Neural Network Depth's Reducer

Overview of Deep Convolutional Neural Network Pruning

PruneAug: Bridging DNN Pruning and Inference Latency on Diverse Sparse Platforms Using Automatic Layerwise Block Pruning

Structured Term Pruning for Computational Efficient Neural Networks Inference

Rethinking the Value of Network Pruning

DPACS: Hardware Accelerated Dynamic Neural Network Pruning Through Algorithm-Architecture Co-design.

Iterative Activation-based Structured Pruning

Optimizing the Deep Neural Networks by Layer-Wise Refined Pruning and the Acceleration on FPGA

Archtree: on-the-fly tree-structured exploration for latency-aware pruning of deep neural networks

Multi-Dimensional Pruning: Joint Channel, Layer and Block Pruning with Latency Constraint

Knapsack Pruning with Inner Distillation

Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures

Joint Pruning and Channel-Wise Mixed-Precision Quantization for Efficient Deep Neural Networks

Pruning and quantization for deep neural network acceleration: A survey