A 28nm 4170-Tflops/w/b and 195-Tflops/mm2/b Multiply-Free Fully-Digital Floating-Point Compute-In-Memory Macro with Mitchell's Approximation

Ruiqi Guo,Xiaofeng Chen,Lei Wang,Fengbin Tu,Shaojun Wei,Yang Hu,Shouyi Yin
DOI: https://doi.org/10.1109/vlsitechnologyandcir46783.2024.10631459
2024-01-01
Abstract:This work presents a SRAM-based digital-domain compute-in-memory (DCIM) macro with three contributions: 1) A floating-point (FP) DCIM structure to convert bit-serial multiplication into bit-parallel addition by Mitchell Approximate Multiply (MAM); 2) A Mitchell MAC unit (MMACU) with optimized full-adder, adder-tree and 2's complement (2C) unit to achieve higher area efficiency; 3) A DCIM bank with latches to save dynamic power by exploiting sparse MAM results. The fabricated 28-nm DCIM achieves 65.15 TFLOPS/W energy efficiency and $304\text{ TFLOPS}/\text{mm}^{2}$ area efficiency for BF16; it is $4170\text{TFLOPS}/\mathrm{W}/\mathrm{b}. 195 \text{TFLOPS}/ \text{mm}^{2} / \mathrm{b}$ , normalized per bit.
What problem does this paper attempt to address?