Abstract:Recent SRAM-based computation-in-memory (CIM) macros enable mid-to-high precision multiply-and-accumulate (MAC) operations with improved energy efficiency using ultra-small/small capacity (0.4-8KB) memory devices. However, advanced CIM-based edge-AI chips favor multiple mid/large capacity SRAM-CIM macros: with high input (IN) and weight (W) precision to reduce the frequency of data reloads from external DRAM, and to avoid the need for additional SRAM buffers or ultra-large on-chip weight buffers. However, enlarging memory capacity and throughput increases the delay parasitics on WLs and BLs, and the number of parallel computing elements; resulting in longer compute latency (tAC), lower energy-efficiency (EF), degraded signal margin, and larger fluctuations in power consumption across data-patterns (see Fig. 16.3.1). Recent SRAM-CIM macros tend to not use in-lab SRAM cells, with a logic-based layout, in favor of foundry provided compact-layout 8T [2], 3, [5] or 6T cells with local-computing cells (LCCs) [4], [6] to reduce the cell-array area and facilitate manufacturing. This paper presents a SRAM-CIM structure using (1) a segmented-BL charge-sharing (SBCS) scheme for MAC operations, with low energy consumption and a consistently high signal margin across MAC values (MACV); (2) An new LCC cell, called a source-injection local-multiplication cell (SILMC), to support the SBCS scheme with a consistent signal margin against transistor process variation; and (3) A prioritized-hybrid-ADC (Ph-ADC) to achieve a small area and power overhead for analog readout. A 28nm 384kb SRAM-CIM macro was fabricated using a foundry compact-6T cell with support for MAC operations with 16 accumulations of 8b-inputs and 8b-weights with near-full precision output (20b). This macro achieves a 7.2ns tAC and a 22.75TOPS/W EF for 8b-MAC operations with an FoM (IN-precision × W-precision × output-ratio × output-channel × EF/tAC) 6× higher than prior work.

A 1.97 TFLOPS/W Configurable SRAM-Based Floating-Point Computation-in-Memory Macro for Energy-Efficient AI Chips.

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

In-Memory Multi-Bit Multiplication and Accumulation (MAC) Using FeFET for Energy Efficient IoT

An 8.8 TFLOPS/W Floating-Point RRAM-Based Compute-in-Memory Macro Using Low Latency Triangle-Style Mantissa Multiplication

A 28nm 314.6TLFOPS/W Reconfigurable Floating-Point Analog Compute-In-Memory Macro with Exponent Approximation and Two-Stage Sharing TD-ADC

A 28-nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for Floating-Point CNNs

A 19.7 TFLOPS/W Multiply-less Logarithmic Floating-Point CIM Architecture with Error-Reduced Compensated Approximate Adder

A 28-nm Floating-Point Computing-in-Memory Processor Using Intensive-CIM Sparse-Digital Architecture

A Reconfigurable Floating-Point Compute-In-Memory with Analog Exponent Pre-Processes

A High-Density and Reconfigurable SRAM-Based Digital Compute-In-Memory Macro for Low-Power AI Chips.

A 28nm 4170-Tflops/w/b and 195-Tflops/mm2/b Multiply-Free Fully-Digital Floating-Point Compute-In-Memory Macro with Mitchell's Approximation

An Energy-Efficient Floating-Point Compute SRAM with Pipelined In-Memory Bit-Parallel Exponent and Bitwise Mantissa Processing

A 28nm 16.9-300TOPS/W Computing-in-Memory Processor Supporting Floating-Point NN Inference/Training with Intensive-CIM Sparse-Digital Architecture

A 28nm 32kb SRAM Computing-in-Memory Macro with Hierarchical Capacity Attenuator and Input Sparsity-Optimized ADC for 4b Mac Operation

A Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors

16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips

A 28nm 128TFLOPS/W Computing-In-Memory Engine Supporting One-Shot Floating-Point NN Inference and On-Device Fine-Tuning for Edge AI

An XOR-10T SRAM computing-in-memory macro with current MAC operations and time-to-digital conversion for BNN edge processors

AFPR-CIM: An Analog-Domain Floating-Point RRAM-based Compute-In-Memory Architecture with Dynamic Range Adaptive FP-ADC