A FPGA Embedded DSP Supporting Parallel Multiple Low Bit-Width Multiply-Accumulate Operations
Miao Wang,Zhihong Huang,Guowei Cai,Junxuan Wang
DOI: https://doi.org/10.3233/atde230112
2023-01-01
Abstract:With the continuous development of big data and hardware computing platforms, deep learning has been substantially applied in many intelligent scenarios. Recent studies have shown that using low bit-width networks in deep learning inference can effectively improve the overall performance of accelerator by reducing the computational ability requirements while maintaining the recognition accuracy of accelerator. Among them, low bit-width convolutional operations such as 8bit and 4bit are widely used in applications such as graph recognition. FPGA chip is the core key device of digital system, due to the excellent reconfigurability of FPGA, it has become one of the mainstream platforms in the field of deep learning accelerator. The current mainstream FPGAs are composed of higher bit-width multipliers due to the need to adapt to different computing application requirements, and the DSP module resources are used to perform low bit-width convolutional operations, which only occupy part of the multiplier bit-width, thus wasting a large amount of hardware on chip resources. Therefore, this paper proposes a DSP architecture of using large bit-width multipliers to compute low bit-width multiplications in parallel, so that the new DSP can realize double 8bit and 4bit multiply-accumulate operations without adding multipliers, and can support any combination of signed and unsigned data operations. The design is based on the commercial Stratix IV DSP architecture, and the overall circuit is designed with SMIC 14nm standard CMOS process. The experimental results show that when calculating the same number of 4-bit and 8-bit multiply-accumulate operations, the resource consumption area of the improved DSP is reduced by 43.5% and the speed is increased by 48%.