APQ: Automated DNN Pruning and Quantization for ReRAM-Based Accelerators
Siling Yang,Shuibing He,Hexiao Duan,Weijian Chen,Xuechen Zhang,Tong Wu,Yanlong Yin
DOI: https://doi.org/10.1109/tpds.2023.3290010
IF: 5.3
2023-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:Emerging ReRAM-based accelerators support in-memory computation to accelerate deep neural network (DNN) inference. Weight matrix pruning is a widely used technique to reduce the size of DNN models, thereby reducing the resource and energy consumption of ReRAM-based accelerators. However, existing pruning works for ReRAM-based accelerators have three major issues. First, they use heuristics or rules from domain experts to prune the weights, leading to sub-optimal pruning policies. Second, they use row or column-level coarse-granularity methods to prune weights, resulting in poor compression rates with model accuracy constraints. Third, they only apply the weight pruning technique individually, losing the compression opportunity of both pruning and quantization. In this article, we propose an Automated DNN Pruning and Quantization framework, named APQ , for ReRAM-based accelerators. First, APQ adopts reinforcement learning (RL) to automatically determine the pruning policy for DNN layers for a global optimum. Second, it prunes and maps weight matrices to a ReRAM-based accelerator in a finer granularity of column-vector, which improves the compression rates with the accuracy constraints. To address the dislocation problem, it uses a new data path in ReRAM-based accelerators to correctly index and feed input to matrix-vector computation. Third, to further reduce resource consumption, APQ also leverages reinforcement learning to automatically determine the quantization bitwidth of each layer of the pruned DNN model. Experimental results show that, APQ achieves up to 4.52X compression rate, 4.11X area efficiency, and 4.51X energy efficiency with similar or even higher model accuracy, compared to the state-of-the-art work.