A Low-Cost Floating-Point FMA Unit Supporting Package Operations for HPC-AI Applications
Hongbing Tan,Jing Zhang,Libo Huang,Xiaowei He,Yongwen Wang,Liquan Xiao
DOI: https://doi.org/10.1109/tcsii.2024.3359678
2024-01-01
Abstract:The convergence of HPC and AI has brought about a diversification of precision, posing significant hardware implementation challenges. This paper aims to address this issue by presenting a low-cost floating-point (FP) fused multiply-add (FMA) unit that is capable of supporting a wide range of FP formats. For the fewer-than-64-bit formats, this innovative FMA unit performs standard or mixed-precision operations fully pipelined in parallel for SP, TF32, BF16, and HP formats. For the 64-bit DP format, the FMA and ADD operations, whether independent or data-related, can be organized into package operations that are executed in two consecutive cycles to eliminate pipeline stall and then improve performance. The proposed FMA unit utilizes iteration and hardware vectorization methods to balance between cost and performance. Compared to a conventional DP FMA unit, the proposed design not only supports a wider range of FP formats and functions but also achieves higher performance with less cost. It can improve performance up to 1.5x more than the dual-mode FMA unit when performing HPC-AI applications.
engineering, electrical & electronic