Toward Energy Efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems

Yueting Li,Xueyan Wang,He Zhang,Biao Pan,Keni Qiu,Wang Kang,Jun Wang,Weisheng Zhao

DOI: https://doi.org/10.1145/3650729

2024-03-07

ACM Transactions on Embedded Computing Systems

Abstract:Convolutional Neural Networks (CNNs) have significantly impacted embedded system applications across various domains. However, this exacerbates the real-time processing and hardware resource-constrained challenges of embedded systems. To tackle these issues, we propose spin-transfer torque magnetic random-access memory (STT-MRAM)-based near memory computing (NMC) design for embedded systems. We optimize this design from three aspects: Fast-pipelined STT-MRAM readout scheme provides higher memory bandwidth for NMC design, enhancing real-time processing capability with a non-trivial area overhead. Direct index compression format in conjunction with digital sparse matrix-vector multiplication (SpMV) accelerator supports various matrices of practical applications that alleviate computing resource requirements. Custom NMC instructions and stream converter for NMC systems dynamically adjust available hardware resources for better utilization. Experimental results demonstrate that the memory bandwidth of STT-MRAM achieves 26.7GB/s. Energy consumption and latency improvement of digital SpMV accelerator are up to 64x and 1120x across sparsity matrices spanning from 10% to 99.8%. Single-precision and double-precision elements transmission increased up to 8x and 9.6x, respectively. Furthermore, our design achieves a throughput of up to 15.9x over state-of-the-art designs.

computer science, software engineering, hardware & architecture

What problem does this paper attempt to address?

This paper mainly discusses how to address the issues of real-time processing and hardware resource limitations in embedded systems. The research proposes a Near Memory Computing (NMC) design based on Spin-Transfer Torque Magnetoresistive Random Access Memory (STT-MRAM) to optimize the performance of embedded systems. Specifically, the optimization design includes three aspects: 1. Implementing a fast pipelined STT-MRAM readout scheme to improve the memory bandwidth of the NMC design and enhance real-time processing capability. 2. Using the Direct Index Packing format combined with the Sparse Matrix-Vector Multiplication (SpMV) accelerator to support matrices in different application scenarios and reduce computational resource requirements. 3. Customizing NMC instructions and stream converters to dynamically adjust hardware resources and improve utilization. Experimental results show that the memory bandwidth of STT-MRAM reaches 26.7GB/s, and the digital SpMV accelerator achieves up to 64 times energy consumption improvement and 1120 times latency improvement on matrices with different sparsity. In addition, the design improves performance by 8 times for single-precision and 9.6 times for double-precision element transfers. Overall, the design improves throughput by 15.9 times compared to existing designs. The paper also outlines the development of MRAM technology, MRAM-centric computing, and research directions in system-level design using high-level synthesis tools. It discusses the specific implementation details of the STT-MRAM readout scheme and digital SpMV accelerator.

Toward Energy Efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

An STT-MRAM Based in Memory Architecture for Low Power Integral Computing

A Multilevel Cell STT-MRAM-Based Computing In-Memory Accelerator for Binary Convolutional Neural Network

Proposal of Analog In-Memory Computing with Magnified Tunnel Magnetoresistance Ratio and Universal STT-MRAM Cell

NAND-SPIN-based processing-in-MRAM architecture for convolutional neural network acceleration

APIM: An Antiferromagnetic MRAM-Based Processing-In-Memory System for Efficient Bit-level Operations of Quantized Convolutional Neural Networks

RRAM Based Buffer Design for Energy Efficient CNN Accelerator.

RAM and TCAM Designs by Using STT-MRAM

SOT-MRAM-Based Design for Energy-Efficient and Reliable Binary Neural Network Acceleration

Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators

Area-Aware Optimization of MRAM Crossbar Array Bit-Cell for In-Memory Computing

StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators

A 28nm 8928Kb/mm 2 -Weight-Density Hybrid SRAM/ROM Compute-in-Memory Architecture Reducing >95% Weight Loading from DRAM.

A Heterogeneous Microprocessor for Intermittent AI Inference Using Nonvolatile-SRAM-based Compute-In-Memory

A 22-nm 1-Mb 1024-b Read Data-Protected STT-MRAM Macro With Near-Memory Shift-and-Rotate Functionality and 42.6-GB/s Read Bandwidth for Security-Aware Mobile Device

A Novel Architecture Of The 3d Stacked Mram L2 Cache For Cmps

An In-Memory Computing Multiply-and-accumulate Circuit Based on Ternary STT-MRAMs for Convolutional Neural Networks.

EXTENT: Enabling Approximation-Oriented Energy Efficient STT-RAM Write Circuit

An area and energy efficient design of domain-wall memory-based deep convolutional neural networks using stochastic computing

Enabling architectural innovations using non-volatile memory.