Abstract:Progress in artificial intelligence and machine learning over the past decade has been driven by the ability to train larger deep neural networks (DNNs), leading to a compute demand that far exceeds the growth in hardware performance afforded by Moore's law. Training DNNs is an extremely memory-intensive process, requiring not just the model weights but also activations and gradients for an entire minibatch to be stored. The need to provide high-density and low-leakage on-chip memory motivates the exploration of emerging non-volatile memory for training accelerators. Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators, including 3-4x higher density than SRAM, significantly reduced leakage power, high endurance and reasonable access time. On the one hand, MRAM write operations require high write energy and latency due to the need to ensure reliable switching. In this study, we perform a comprehensive device-to-system evaluation and co-optimization of STT-MRAM for efficient ML training accelerator design. We devised a cross-layer simulation framework to evaluate the effectiveness of STT-MRAM as a scratchpad replacing SRAM in a systolic-array-based DNN accelerator. To address the inefficiency of writes in STT-MRAM, we propose to reduce write voltage and duration. To evaluate the ensuing accuracy-efficiency trade-off, we conduct a thorough analysis of the error tolerance of input activations, weights, and errors during the training. We propose heterogeneous memory configurations that enable training convergence with good accuracy. We show that MRAM provide up to 15-22x improvement in system level energy across a suite of DNN benchmarks under iso-capacity and iso-area scenarios. Further optimizing STT-MRAM write operations can provide over 2x improvement in write energy for minimal degradation in application-level training accuracy.

Work-in-Progress: Toward Energy-efficient Near STT-MRAM Processing Architecture for Neural Networks

Toward Energy Efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

SOT-MRAM-Based Design for Energy-Efficient and Reliable Binary Neural Network Acceleration

NAND-SPIN-based processing-in-MRAM architecture for convolutional neural network acceleration

TIME: A Training-in-Memory Architecture for RRAM-Based Deep Neural Networks

Energy Efficient RRAM Spiking Neural Network for Real Time Classification

A Multilevel Cell STT-MRAM-Based Computing In-Memory Accelerator for Binary Convolutional Neural Network

Spiking Neural Network with RRAM: Can We Use It for Real-World Application?

HXNOR-PBNN: A Scalable and Parallel Spintronics Synaptic Architecture for Probabilistic Binary Neural Networks

An In-Memory Computing Multiply-and-accumulate Circuit Based on Ternary STT-MRAMs for Convolutional Neural Networks.

APIM: An Antiferromagnetic MRAM-Based Processing-In-Memory System for Efficient Bit-level Operations of Quantized Convolutional Neural Networks

Implementing Binarized Neural Networks with Magnetoresistive RAM without Error Correction

SNrram: an Efficient Sparse Neural Network Computation Architecture Based on Resistive Random-Access Memory.

Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators

Energy Efficient Spiking Neural Network Design with RRAM Devices

An STT-MRAM Based in Memory Architecture for Low Power Integral Computing

RRAM-DNN: an RRAM and Model-Compression Empowered All-Weights-On-Chip DNN Accelerator

Long-Term Accuracy Enhancement of Binary Neural Networks Based on Optimized Three-Dimensional Memristor Array

FangTianSim: High-Level Cycle-Accurate Resistive Random-Access Memory-Based Multi-Core Spiking Neural Network Processor Simulator

7.5 A 65nm 0.39-to-140.3tops/w 1-to-12b Unified Neural Network Processor Using Block-Circulant-Enabled Transpose-Domain Acceleration with 8.1 × Higher TOPS/mm2and 6T HBST-TRAM-Based 2D Data-Reuse Architecture