A Flexible Yet Efficient DNN Pruning Approach for Crossbar-Based Processing-in-Memory Architectures.
Long Zheng,Haifeng Liu,Yu Huang,Dan Chen,Chaoqiang Liu,Haiheng He,Xiaofei Liao,Hai Jin,Jingling Xue
DOI: https://doi.org/10.1109/tcad.2022.3197510
IF: 2.9
2022-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:Pruning deep neural networks (DNNs) can reduce the model size and thus save hardware resources of a resistive-random-access-memory (ReRAM)-based DNN accelerator. For the tightly coupled crossbar structure, existing ReRAM-based pruning techniques prune the weights of a DNN in a structured manner, thereby attaining low pruning ratios. This article presents a novel pruning technique, SegPrune, for pruning the weights of a DNN flexibly on crossbar architectures in order to maximize the pruning ratio achieved while preserving crossbar efficiency. We observe that different filters of a weight matrix share a large number of matrix subcolumns (in the same rows), called segments, that can be pruned by using the same segment shape in the sense that the weights at the same column position of these segments are either simultaneously accuracy-sensitive (and should thus be reserved) or simultaneously accuracy-insensitive (and can thus be pruned). Due to the bit-line exchangeability in the crossbar, segments with the same pruning shape can be assembled together into the same crossbar to ensure crossbar execution efficiency. We propose a projection-based shape voting algorithm to select suitable segment shapes to drive the weight pruning process. Accordingly, we also introduce a low-overhead data path that can be easily integrated into any existing ReRAM-based DNN accelerator, achieving a high pruning ratio and a high execution efficiency. Our evaluation shows that SegPrune outperforms the state-of-the-art, Hybrid-P, and FORMAS, by up to $14.6\times $ and $3.6\times $ in pruning ratio, $13.9\times $ and $3.4\times $ in inference speedup, and $12.5\times $ and $3.1\times $ in energy reduction, respectively, while achieving an even higher accuracy at the cost of less than 0.27% extra hardware area overhead.