Abstract:Progress in artificial intelligence and machine learning over the past decade has been driven by the ability to train larger deep neural networks (DNNs), leading to a compute demand that far exceeds the growth in hardware performance afforded by Moore's law. Training DNNs is an extremely memory-intensive process, requiring not just the model weights but also activations and gradients for an entire minibatch to be stored. The need to provide high-density and low-leakage on-chip memory motivates the exploration of emerging non-volatile memory for training accelerators. Spin-Transfer-Torque MRAM (STT-MRAM) offers several desirable properties for training accelerators, including 3-4x higher density than SRAM, significantly reduced leakage power, high endurance and reasonable access time. On the one hand, MRAM write operations require high write energy and latency due to the need to ensure reliable switching. In this study, we perform a comprehensive device-to-system evaluation and co-optimization of STT-MRAM for efficient ML training accelerator design. We devised a cross-layer simulation framework to evaluate the effectiveness of STT-MRAM as a scratchpad replacing SRAM in a systolic-array-based DNN accelerator. To address the inefficiency of writes in STT-MRAM, we propose to reduce write voltage and duration. To evaluate the ensuing accuracy-efficiency trade-off, we conduct a thorough analysis of the error tolerance of input activations, weights, and errors during the training. We propose heterogeneous memory configurations that enable training convergence with good accuracy. We show that MRAM provide up to 15-22x improvement in system level energy across a suite of DNN benchmarks under iso-capacity and iso-area scenarios. Further optimizing STT-MRAM write operations can provide over 2x improvement in write energy for minimal degradation in application-level training accuracy.

What problem does this paper attempt to address?

The paper primarily discusses the application of Spin-Transfer-Torque Magnetic Random Access Memory (STT-MRAM) as an on-chip Scratchpad in machine learning accelerators during the training process. The paper aims to address the following key issues: 1. **Alleviating the Memory Wall Bottleneck**: As the scale of Deep Neural Networks (DNN) expands, the memory demand during training far exceeds the hardware performance improvements brought by Moore's Law. Training DNNs requires a large amount of storage for data structures such as model weights, activations, and gradients, which cannot be fully stored on-chip, leading to costly off-chip memory accesses that limit training speed and energy efficiency. 2. **Challenges of Replacing SRAM**: Current training accelerators use large-scale Static Random Access Memory (SRAM) as a cache, but SRAM's high power consumption and limited density scaling potential prompt researchers to explore denser non-volatile storage technologies. STT-MRAM, with its high endurance and reasonable access times, is seen as a potential alternative. 3. **Write Efficiency Issues of STT-MRAM**: Although STT-MRAM offers higher density and significantly reduced leakage power compared to SRAM, its write operations require more energy and time due to the high write voltage and long write cycles needed to ensure reliable magnetization switching. The main contributions of the paper include: 1. **Cross-Layer Design Space Exploration Framework**: A design space exploration framework from the device to the system level has been developed to evaluate the effectiveness of STT-MRAM as a cache in DNN training accelerators and to compare it with SRAM, studying potential energy efficiency improvements in equal capacity and equal area scenarios. 2. **Optimized Write Operations**: Proposes to address the write bottleneck of STT-MRAM by using reduced write voltage and shortened write duration, assessing the impact of low-energy write operations on the energy efficiency of the training process and the final DNN model accuracy. Based on the tolerance to errors of input activations and weights during training, a heterogeneous memory architecture is proposed, where different parts of numbers (such as exponents and mantissas) are mapped to STT-MRAM arrays with different write error rates. 3. **Energy Efficiency Improvements**: Replacing SRAM with STT-MRAM as an on-chip cache, energy efficiency is improved by up to 15 times and 23 times in equal capacity and equal area scenarios, respectively. Further optimization of STT-MRAM write operations can increase system-level write energy efficiency by more than 2 times, with only a minimal trade-off in application-level training accuracy. Through the above work, the paper attempts to overcome the challenges of STT-MRAM in the application of deep learning training accelerators, especially in terms of write operation efficiency, thereby achieving a more efficient and lower power consumption machine learning hardware system.

Evaluation of STT-MRAM as a Scratchpad for Training in ML Accelerators

Designing Efficient and High-performance AI Accelerators with Customized STT-MRAM

Accelerator-friendly neural-network training: Learning variations and defects in RRAM crossbar.

On-Device Continual Learning with STT-Assisted-SOT MRAM Based In-Memory Computing

ITT-RNA: Imperfection Tolerable Training for RRAM-Crossbar-Based Deep Neural-Network Accelerator

OnSRAM: Efficient Inter-Node On-Chip Scratchpad Management in Deep Learning Accelerators

Multiscale Co-Design Analysis of Energy, Latency, Area, and Accuracy of a ReRAM Analog Neural Training Accelerator

Sparsity-Oriented MRAM-Centric Computing for Efficient Neural Network Inference

TIME: A Training-in-Memory Architecture for RRAM-Based Deep Neural Networks

RRAM-DNN: an RRAM and Model-Compression Empowered All-Weights-On-Chip DNN Accelerator

A 3d Multi-Layer Cmos-Rram Accelerator for Neural Network

Multi-Port 1R1W Transpose Magnetic Random Access Memory by Hierarchical Bit-Line Switching

A STT-Assisted SOT MRAM-Based In-Memory Booth Multiplier for Neural Network Applications

SOT-MRAM-Based Design for Energy-Efficient and Reliable Binary Neural Network Acceleration

Toward Energy Efficient STT-MRAM-based Near Memory Computing Architecture for Embedded Systems

MRIMA: an MRAM-Based In-Memory Accelerator

A High-Speed and High-Efficiency Diverse Error Margin Write-Verify Scheme for an RRAM-Based Neuromorphic Hardware Accelerator

SPARE: Spiking Networks Acceleration Using CMOS ROM-Embedded RAM as an In-Memory-Computation Primitive

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

Intra-array Non-Idealities Modeling and Algorithm Optimization for RRAM-based Computing-in-Memory Applications

Triple-skipping Near-Mram Computing Framework for AIoT Era