An 8.8 TFLOPS/W Floating-Point RRAM-Based Compute-in-Memory Macro Using Low Latency Triangle-Style Mantissa Multiplication

Xianwu Hu,Yu Wang,Zizhao Ma,Gan Wen,Zeming Wang,Zhichao Lu,Yunlong Liu,Yanlei Li,Xingdong Liang,Xiaoyang Zeng,Yufeng Xie
DOI: https://doi.org/10.1109/tcsii.2023.3283418
2023-01-01
IEEE Transactions on Circuits & Systems II Express Briefs
Abstract:High-precision computation with low latency and high energy efficiency is required for AI-driven application and scientific computing. Emerging compute-in-memory (CIM) technology shows a great potential to accelerate multiplication and accumulation (MAC) operations which are frequently executed in such scenarios. Resistive RAM (RRAM) is highly suitable for CIM due to its excellent features such as nonvolatility, small cell size and MAC-friendly structure. However, the existing RRAM CIMs focus on the acceleration of fixed-point/integer operations. Several works adopt the logic-CIM structure to support high-precision Floating-point (FP) calculations, but they require lots of cycles and area to perform a FP operation. To meet the need of low latency and high energy efficiency of widely used FP calculation, we propose an accelerated FP-MAC architecture, based on 40nm RRAM CIM array. A full-parallel data input scheme and triangle weights arrangement is proposed for low latency multi-bits multiplication. A non-uniformly grouped sense amplifiers (NUGSAs) array is adopted for energy and area saving. Experiments show that the proposed FP-MAC design achieves an energy efficiency of up to 8.8 TFLOPS/W at FP8 mode and 3.3 TFLOPS/W at bFP16 mode, and the computing latency is 3.34ns.
What problem does this paper attempt to address?