An Efficient CNN Training Accelerator Leveraging Transposable Block Sparsity

Mingyang Xu,Jinming Lu,Zhongfeng Wang,Jun Lin
DOI: https://doi.org/10.1109/aicas54282.2022.9869938
2022-01-01
Abstract:Convolutional neural network (CNN) training is computationally intensive, requiring a great deal of time and resources. Exploiting data sparsity is a promising method to ac-celerate CNN training. In this work, we propose a novel algorithm for sparse training processes in which the weight matrices are pruned in a fine-grained block-wise manner. Both the forward propagation (FP) and backward propagation (BP) phases use the identical data layout. It can eliminate the matrix transposition procedure, reducing storage space and training time. Based on this pruning approach, we developed an FPGA-based accelerator for CNN training using a systolic array. The architecture can effectively skip the zero values calculation without incurring the imbalance between different processing elements (PEs). Our experimental results indicate that our design achieves 1.024 TOPS and 118.4 GOPS/W in terms of computational throughput and energy efficiency. Our design is 1.41× ~ 4.93× more energy efficient than the state-of-the-art training accelerator.
What problem does this paper attempt to address?