A 200MHZ 202.4GFLOPS@10.8W VGG16 Accelerator in Xilinx VX690T.

Chunsheng Mei,Zhenyu Liu,Yue Niu,Xiangyang Ji,Wei Zhou,Dongsheng Wang
DOI: https://doi.org/10.1109/globalsip.2017.8309067
2017-01-01
Abstract:Convolutional Neural Networks (CNN) are among the most powerful and widely used algorithms for computer vision applications, notwithstanding their computation-demanding and memory-intensive operations. The cumbersome CNN operation stems from the bulky cross channel computation and massive parameter retrieving of convolutional (CONV) layers and fully-connected (FC) layers, respectively. In this paper, to remove the inter-filter redundancy, we constructed and tuned the specific low-rank filters in fully-connected layers. The proposed rank reduction saves 88.9% of both arithmetic and parameters of fully-connected layers in the VGG16 model. In addition, by employing network-layer-wise ping-pong DDR access mode, tile-grain on-chip feature map buffers, and Propagate Partial Multiply-Accumulate (PPMAC) processor, we implemented a 202.4 GFLOPS CNN accelerator with half-precision data format on Xilinx VC709 evaluation board. Experiments show that the accelerator achieved 6.58 fps throughput with 0.7046 top-1 accuracy and 0.8977 top-5 accuracy under 200MHz working frequency.
What problem does this paper attempt to address?