A 28nm 128TFLOPS/W Computing-In-Memory Engine Supporting One-Shot Floating-Point NN Inference and On-Device Fine-Tuning for Edge AI

Haikang Diao,Haoyang Luo,Jiahao Song,Bocheng Xu,Runsheng Wang,Yuan Wang,Xiyuan Tang
DOI: https://doi.org/10.1109/cicc60959.2024.10528985
2024-01-01
Abstract:Recent research has extended computing-in-memory (CIM) to floating-point (FP) operations, enabling high-precision computation to handle complex edge tasks such as object detection and segmentation [1]–[3]. However, the ever-growing edge intelligence escalated the need for higher throughput, better energy efficiency, and on-device updates, imposing significant challenges on prior pre-aligning-based FP CIMs (Fig. 1). 1) A fundamental limitation exists in the INT mantissa multiply-accumulate (MAC): bit-parallel computation is fast but consumes significant area/energy due to wide bit-width multipliers and adder trees, and thus, most designs adopt the bit-serial compute scheme. However, it requires multiple compute cycles. E.g., 8 cycles are required for a BF16 mantissa MAC, severely limiting the throughput. 2) The exponent sorting and mantissa normalization process of FP/INT conversion in previous FP CIMs introduce a complex comparison tree and shifter, greatly increasing the area/energy overhead. 3) Previous FP CIMs do not support on-device fine-tuning for environment changes, resulting in accuracy loss in real-world applications.
What problem does this paper attempt to address?