Improving HW/SW Adaptability for Accelerating CNNs on FPGAs Through A Dynamic/Static Co-Reconfiguration Approach

Lei Gong,Chao Wang,Xi Li,Xuehai Zhou
DOI: https://doi.org/10.1109/tpds.2020.3046762
IF: 5.3
2020-01-01
IEEE Transactions on Parallel and Distributed Systems
Abstract:With the continuous evolution of Convolutional Neural Networks (CNNs) and the improvement of the computing capability of FPGAs, the deployment of CNN accelerator based on FPGA has become more and more popular in various computing scenarios. The key element of implementing these accelerators is to take full advantage of underlying hardware characteristics to adapt to the computational features of the software-level CNN model. To achieve this goal, however, previous designs mainly focus on the static hardware reconfiguration pattern, which is not flexible enough and can hardly make the accelerator architecture and the CNN features fully fit, resulting in inefficient computations and data communications. By leveraging the dynamic partial reconfiguration technology equipped in the modern FPGA devices, in this article, we propose a new accelerator architecture for implementing CNNs on FPGAs in which static and dynamic reconfigurabilities of the hardware are cooperatively utilized to maximize the acceleration efficiency. Based on this architecture, we further present a systematic design and optimization methodology for implementing the specific CNN model in the particular computing scenario, in which a static design space exploration method and a reinforcement learning-based decision method are proposed to obtain the optimal static hardware configuration and run-time reconfiguration strategy respectively. We evaluate our proposal by implementing three widely used CNN models, AlexNet, VGG16C, and ResNet34, on the Xilinx ZCU102 FPGA platform. Experimental results show that our implementations on average can achieve 683 GOPS under 16-bit fixed data type and 1.37 TOPS under 8-bit fixed data type for three targeted CNN models, and improve the computational density from 1.1× to 1.91× compared with previous implementations on the same type of FPGA platform.
What problem does this paper attempt to address?