An Efficient CNN Accelerator Achieving High PE Utilization Using a Dense-/Sparse-Aware Redundancy Reduction Method and Data–Index Decoupling Workflow
Yishuo Meng,Chen Yang,Siwei Xiang,Jianfei Wang,Kuizhi Mei,Li Geng
DOI: https://doi.org/10.1109/tvlsi.2023.3298509
2023-01-01
IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Abstract:To adapt to complex scenes and strict accuracy requirements, evolutions have unstoppably occurred in current convolutional neural networks (CNNs). However, these evolutions bring changes to filter size, convolution type, and sparsity, and such diversity leads to difficulties when adopting evolving CNNs in field-programmable gate array (FPGA)-based accelerators. This article proposes a dense-/sparse-aware CNN accelerator to achieve high PE utilization and configurability. First, a filter-based decomposition and clustering algorithm (FDCA) is proposed to change the various-sized filters into unified size filters. In addition, a sparse-aware filter transformation scheme (SFTS) is presented to dynamically eliminate invalid weights for sparse filters and accelerate dense filters. Based on the elimination of sparsity dependency, a hardware accelerator with a data–index decoupling workflow and an input channel schedule-distribution system is designed to take advantage of FDCA and SFTS. The proposed accelerator is implemented on a Xilinx ZCU102 platform at 300 MHz. With different CNN configurations, the digital signal processor (DSP) efficiencies for dense and unstructured sparse AlexNet and dense and structured sparse MobileNetV2 are 0.987, 2.025, 0.547, and 1.278 GOPS/DSP, respectively. Compared with previous dense- and sparse-based designs, the accelerator achieves up to a $4.263\times $ speedup in DSP efficiency.