Abstract:Computing-in-memory (CIM) chips have demonstrated promising high energy efficiency on multiply–accumulate (MAC) operations for artificial intelligence (AI) applications. Though integral (INT) CIM chips are emerging, the floating-point (FP) CIM chip has not been well explored. The high-accuracy demand of larger models and complex tasks requires FP computation. Besides, most of the neural network (NN) training tasks still rely on FP computation. This work presents an energy-efficient FP CIM processor. It is observed that most of the exponent values of FP data are concentrated in a small region. Therefore, the FP computations are divided into intensive and sparse parts and then executed on an intensive-CIM sparse-digital architecture. First, an FP-to-INT CIM workflow for the intensive FP operations is designed to reduce the CIM execution cycles. Second, a flexible sparse-digital core is proposed for the remaining sparse FP operations. Utilizing both the intensive-CIM and sparse-digital cores, this work can achieve both high energy efficiency and identical accuracy to the FP algorithm baseline. Considering the FP CIM execution flow, a CIM-friendly low-bit FP training method is proposed to further reduce the execution cycles. Besides, a low-MAC-value (MACV) CIM macro is designed to utilize the more random sparsity brought by FP alignment. The 28-nm fabricated chip shows 275–1615-TOPS/W@INT4 and 17.2–91.3-TOPS/W@FP16 macro energy efficiency from dense to the average sparsity on the tested models.

A Reconfigurable Floating-Point Compute-In-Memory with Analog Exponent Pre-Processes

A 28nm 314.6TLFOPS/W Reconfigurable Floating-Point Analog Compute-In-Memory Macro with Exponent Approximation and Two-Stage Sharing TD-ADC

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A 1.97 TFLOPS/W Configurable SRAM-Based Floating-Point Computation-in-Memory Macro for Energy-Efficient AI Chips.

A 28-nm Floating-Point Computing-in-Memory Processor Using Intensive-CIM Sparse-Digital Architecture

A 28nm 16.9-300TOPS/W Computing-in-Memory Processor Supporting Floating-Point NN Inference/Training with Intensive-CIM Sparse-Digital Architecture

A 19.7 TFLOPS/W Multiply-less Logarithmic Floating-Point CIM Architecture with Error-Reduced Compensated Approximate Adder

A 28nm 128TFLOPS/W Computing-In-Memory Engine Supporting One-Shot Floating-Point NN Inference and On-Device Fine-Tuning for Edge AI

GCFP-ACIM: A 40nm 4.74TFLOPS/W General Complex Float-Point Analog Compute-in-Memory with Adaptive Power-Saving for HDR Signal Processing Applications

An 8.8 TFLOPS/W Floating-Point RRAM-Based Compute-in-Memory Macro Using Low Latency Triangle-Style Mantissa Multiplication

ReDCIM: Reconfigurable Digital Computing- in -Memory Processor with Unified FP/INT Pipeline for Cloud AI Acceleration

A 28-nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for Floating-Point CNNs

AFPR-CIM: An Analog-Domain Floating-Point RRAM-based Compute-In-Memory Architecture with Dynamic Range Adaptive FP-ADC

A 28nm 4170-Tflops/w/b and 195-Tflops/mm2/b Multiply-Free Fully-Digital Floating-Point Compute-In-Memory Macro with Mitchell's Approximation

An Energy-Efficient Floating-Point Compute SRAM with Pipelined In-Memory Bit-Parallel Exponent and Bitwise Mantissa Processing

A High-Density and Reconfigurable SRAM-Based Digital Compute-In-Memory Macro for Low-Power AI Chips.

A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration

A Fully Bit-Flexible Computation in Memory Macro Using Multi-Functional Computing Bit Cell and Embedded Input Sparsity Sensing

A 28nm 8Kb Reconfigurable SRAM Computing-In-Memory Macro With Input-Sparsity Optimized DTC for Multi-mode MAC Operations