Improving the computational efficiency and flexibility of FPGA-based CNN accelerator through loop optimization

Yuhao Liu,Yanhua Ma,Bowei Zhang,Lu Liu,Jie Wang,Shibo Tang
DOI: https://doi.org/10.1016/j.mejo.2024.106197
IF: 1.992
2024-04-11
Microelectronics Journal
Abstract:The convolution operation consists of three-dimensional multiply-accumulate (MAC) operations within four loops, leading to a large design space to be optimized. However, prior research did not thoroughly investigate the loop optimization operations, which led to the development of accelerators that employed inefficient parallel computing architectures and hence consumed unnecessary resources. This study addresses the limitations of existing FPGA-based Convolutional Neural Network (CNN) accelerators in terms of computational efficiency and flexibility by proposing a novel scalable accelerator architecture. We first define a design space that includes loop optimization operations such as loop tiling, loop interchange, and loop unrolling. Based on this, we explore a more efficient dataflow and accelerator architecture through a quantitative analysis of the trade-off between accelerator performance and hardware costs. Then, this paper demonstrates exploring the optimal loop optimization strategy within the design space to guide the design of accelerator architectures, advancing towards the optimal solutions for accelerator performance. The effectiveness of the suggested acceleration architecture is confirmed by implementing VGG-16, ResNet-50, and ResNet-152 on Xilinx ZCU102 and Xilinx ZCU111 FPGAs. The achieved peak throughputs for the networks are 721.48 GOPS, 546.98 GOPS, and 664.66 GOPS, demonstrating outstanding performance, efficient resource usage, and flexibility.
engineering, electrical & electronic,nanoscience & nanotechnology
What problem does this paper attempt to address?