SSiMD: Supporting Six Signed Multiplications in a DSP Block for Low-Precision CNN on FPGAs

Qi Liu,Mo Sun,Jie Sun,Liqiang Lu,Jieru Zhao,Zeke Wang
DOI: https://doi.org/10.1109/ICFPT59805.2023.00023
2023-01-01
Abstract:It has been widely adopted to deploy low-precision CNN model inference on FPGAs for edge applications, such as target detection or graph classification, where convolution is always the most time-consuming kernel because it introduces massive multiplications. The existing work shows that quantized low-precision convolution kernels work well for CNN model inference. Typically, 4-bit precision is enough to guarantee model inference accuracy. When deploying 4-bit convolution kernels on FPGAs, the state-of-the-art approach TiNNA implements three 4-bit multiplications within a DSP block, e.g., 27x18 DSP48E1, for high hardware efficiency. Particularly, TiNNA puts three 4bit numbers on the 27-bit operand and one 4-bit number on the 18-bit operand of a DSP block. However, using a DSP block to implement more multipliers introduces severe interference between multiplications, especially when any operands are signed. We surprisingly observe that convolution can take advantage of interference between multiplications with careful design. To this end, we propose SSiMD, a DSP-efficient low-precision multiplication mechanism that enables six 4-bit multiplications in a DSP block for low-precision convolution kernels. SSiMD consists of two innovations. The first innovation is the hardware-friendly signed number transformation, and the second invocation is to efficiently apply its transformation to the DSP block mapping that enables six unsigned multiplications within a DSP block, so as to fully leverage the interference. Besides, we optimize the data transmission in the system to make the data transmission faster and save resources. The experiment shows that compared to TiNNA, SSiMD achieves 1.32x throughput while consuming 0.66x DSPs, 0.66x BRAMs, and 0.98x LUTs.
What problem does this paper attempt to address?