A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices

Cheng-Xin Xue,Yen-Cheng Chiu,Ta-Wei Liu,Tsung-Yuan Huang,Je-Syu Liu,Ting-Wei Chang,Hui-Yao Kao,Jing-Hong Wang,Shih-Ying Wei,Chun-Ying Lee,Sheng-Po Huang,Je-Min Hung,Shih-Hsih Teng,Wei-Chen Wei,Yi-Ren Chen,Tzu-Hsiang Hsu,Yen-Kai Chen,Yun-Chen Lo,Tai-Hsing Wen,Chung-Chuan Lo,Ren-Shuo Liu,Chih-Cheng Hsieh,Kea-Tiong Tang,Mon-Shu Ho,Chin-Yi Su,Chung-Cheng Chou,Yu-Der Chih,Meng-Fan Chang

DOI: https://doi.org/10.1038/s41928-020-00505-5

IF: 33.255

2020-12-14

Nature Electronics

Abstract:<p>Nature Electronics, Published online: 14 December 2020; <a href="https://www.nature.com/articles/s41928-020-00505-5">doi:10.1038/s41928-020-00505-5</a></p>Commercial complementary metal–oxide–semiconductor and resistive random-access memory technologies can be used to create multibit compute-in-memory circuits capable of fast and energy-efficient inference for use in small artificial intelligence edge devices.

engineering, electrical & electronic

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving efficient, low-power artificial intelligence (AI) computation on edge devices. Specifically, the paper tackles a series of challenges faced by non-volatile compute-in-memory (nvCIM) architectures when performing dot product operations, including: 1. **Precision of input-weight-output configuration**: Existing nvCIM architectures suffer from insufficient precision when handling multi-bit inputs, weights, and outputs, which limits the complexity and inference accuracy of neural networks. 2. **Performance bottleneck**: Data transfer in traditional von Neumann architectures leads to high latency and high energy consumption, forming the so-called "memory wall" bottleneck. 3. **Parallel input and cell area limitations**: Large-scale parallel input and high-precision weight storage require more cell area, increasing design complexity and energy consumption. 4. **Signal margin degradation**: Current leakage in high resistance state (HRS) cells leads to a decrease in signal margin, affecting computational accuracy. 5. **Delay and energy consumption of multi-bit analog readout operations**: High-precision analog-to-digital conversion requires longer delays and higher energy consumption. To overcome these challenges, the paper proposes a 2 Mb fully complementary metal-oxide-semiconductor (CMOS) integrated resistive random-access memory (ReRAM) nvCIM macro-architecture, achieving higher input-output parallelism, reduced cell array area, improved precision, and reduced computational delay and energy consumption through the following techniques: - **Bit-line input-output multi-bit computation (BLIOMC) scheme**: Using single word-line and input-aware multi-bit bit-line clamping (IA-MBC) reduces the dynamic range of bit-line current, shortens input delay, and increases the number of parallel inputs. - **Staggered binary complement weight mapping and biasing (S2CWMB) scheme**: Reduces area overhead and current consumption. - **In-situ high resistance state current cancellation (HRS-C) scheme**: Improves signal margin and reduces energy consumption. - **High resistance state first quantization (HRS-FQ) process**: Balances energy consumption and inference accuracy. - **Dual-bit small offset current mode sense amplifier (DbSO-CSA)**: Shortens delay and reduces energy consumption of multi-bit readout operations. - **Global replica local mixed reference current generation (GRLM-RCG) scheme**: Reduces energy consumption of reference current generation. Through these techniques, the proposed nvCIM macro-architecture achieves delays of 9.2 to 18.3 nanoseconds and energy efficiency of 146.21 to 36.61 tera-operations per second per watt under binary and multi-bit input-weight-output configurations, respectively.

A CMOS-integrated compute-in-memory macro based on resistive random-access memory for AI edge devices

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A CMOS-integrated spintronic compute-in-memory macro for secure AI edge devices

CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors

In-Memory Multi-Bit Multiplication and Accumulation (MAC) Using FeFET for Energy Efficient IoT

15.5 A 28nm 64Kb 6T SRAM Computing-in-Memory Macro with 8b MAC Operation for AI Edge Chips

A 28nm Hybrid 2T1R RRAM Computing-in-Memory Macro for Energy-efficient AI Edge Inference

A Local Computing Cell and 6T SRAM-Based Computing-in-Memory Macro With 8-b MAC Operation for Edge AI Chips

Challenges and Trends of SRAM-Based Computing-In-Memory for AI Edge Devices

A High-Density and Reconfigurable SRAM-Based Digital Compute-In-Memory Macro for Low-Power AI Chips.

A Twin-8T SRAM Computation-in-Memory Unit-Macro for Multibit CNN-Based AI Edge Processors

A computing-in-memory macro based on three-dimensional resistive random-access memory

A 4-Kb 1-to-8-bit Configurable 6T SRAM-Based Computation-in-Memory Unit-Macro for CNN-Based AI Edge Processors

16.3 A 28nm 384kb 6T-SRAM Computation-in-Memory Macro with 8b Precision for AI Edge Chips

A 28-Nm RRAM Computing-in-Memory Macro Using Weighted Hybrid 2T1R Cell Array and Reference Subtracting Sense Amplifier for AI Edge Inference

An 8-Mb DC-Current-Free Binary-to-8b Precision ReRAM Nonvolatile Computing-in-Memory Macro using Time-Space-Readout with 1286.4-21.6TOPS/W for Edge-AI Devices

Fusion of memristor and digital compute-in-memory processing for energy-efficient edge computing

A 1.041-Mb/mm 2 27.38-TOPS/W Signed-INT8 Dynamic-Logic-Based ADC-less SRAM Compute-in-Memory Macro in 28nm with Reconfigurable Bitwise Operation for AI and Embedded Applications

An ADC-less RRAM-based Computing-in-Memory Macro with Binary CNN for Efficient Edge AI

A Fully Integrated System‐on‐Chip Design with Scalable Resistive Random‐Access Memory Tile Design for Analog in‐Memory Computing

A 28 Nm RRAM-Based 81.1 TOPS/mm2/bit Compute-In-Memory Macro with Uniform and Linear 64 Read Channels under 512 4-Bit Inputs