PACA: A Pattern Pruning Algorithm and Channel-Fused High PE Utilization Accelerator for CNNs.

Jingyu Wang,Songming Yu,Zhuqing Yuan,Jinshan Yue,Zhe Yuan,Ruoyang Liu,Yanzhi Wang,Huazhong Yang,Xueqing Li,Yongpan Liu
DOI: https://doi.org/10.1109/tcad.2022.3140730
IF: 2.9
2022-01-01
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract:In recent years, convolutional neural networks (CNNs) have achieved significant advancements in various fields. However, the computation and storage overheads of CNNs are overwhelming for Internet-of-Things devices. Both network pruning algorithms and hardware accelerators have been introduced to empower CNN inference at the edge. Network pruning algorithms reduce the size and computational cost of CNNs by regularizing unimportant weights to zeros. However, existing works lack intrakernel structured types to tradeoff between sparsity and hardware efficiency, and the index storage for irregularly pruned networks is significant. Hardware accelerators leverage the sparsity of pruned CNNs to improve energy efficiency. However, their process element (PE) utilization rate is low because of uneven sparsity among input convolutional kernels. To overcome these problems, we propose PACA: a Pattern pruning Algorithm and Channel-fused high PE utilization Accelerator for CNNs. It includes three parts: a pattern pruning algorithm to explore the intrakernel sparsity type and reduce the index storage, a channel-fused hardware architecture to reduce the PEs’ idle rate and improve the performance, and a heuristic and taboo search-based smart fusion scheduler to analyze the idle PE problem and schedule the channel fusion in hardware. To demonstrate the effectiveness of PACA, we have implemented the software parts by Python and the hardware architecture by RTL codes. Experimental results on various datasets show that compared with an existing work, PACA can reduce the index storage overhead by $3.47\times $ – $5.63\times $ with 3.85–9.12 average patterns, and it can improve the hardware performance by $2.01\times $ – $5.53\times $ because of PEs’ idle rate reduction.
What problem does this paper attempt to address?