Evaluating Low-Memory GEMMs for Convolutional Neural Network Inference on FPGAs

Wentai Zhang,Ming Jiang,Guojie Luo
DOI: https://doi.org/10.1109/FCCM48280.2020.00013
2020-01-01
Abstract:FPGAs are becoming significant for implementing low-latency convolutional neural networks, because of performance demands and power constraints. Conventional implementations of convolutional layers are usually direct convolution, involving nested loops over channels, feature maps, and filters. Explicit general matrix multiplications (GEMMs) cost extra memory space, and the limited on-chip RAMs prevent an efficient GEMM-based implementation. In this paper, we evaluate a low-memory method of GEMMs on FPGAs based systolic arrays. We design a novel accelerator to save the bandwidth and increase the parallelism. We evaluate our design on MobileNet V1 and Inception V4. Our implementation achieves a throughput of around 3.5 TOP/s for both models. We also reduce the memory usage by 21% compared to explicit GEMM implementation for MobileNet V1 and 44% for Inception V4.
What problem does this paper attempt to address?