Abstract:We introduce $\textit{sorted weight sectioning}$ (SWS): a weight allocation algorithm that places sorted deep neural network (DNN) weight sections on bit-sliced compute-in-memory (CIM) crossbars to reduce analog-to-digital converter (ADC) energy consumption. Data conversions are the most energy-intensive process in crossbar operation. SWS effectively reduces this cost leveraging (1) small weights and (2) zero weights (weight sparsity). DNN weights follow bell-shaped distributions, with most weights near zero. Using SWS, we only need low-order crossbar columns for sections with low-magnitude weights. This reduces the quantity and resolution of ADCs used, exponentially decreasing ADC energy costs without significantly degrading DNN accuracy. Unstructured sparsification further sharpens the weight distribution with small accuracy loss. However, it presents challenges in hardware tracking of zeros: we cannot switch zero rows to other layer weights in unsorted crossbars without index matching. SWS efficiently addresses unstructured sparse models using offline remapping of zeros into earlier sections, which reveals full sparsity potential and maximizes energy efficiency. Our method reduces ADC energy use by 89.5% on unstructured sparse BERT models. Overall, this paper introduces a novel algorithm to promote energy-efficient CIM crossbars for unstructured sparse DNN workloads.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to reduce the energy consumption of analog - to - digital converters (ADCs) in Compute - in - Memory (CIM) crossbars when processing unstructured sparse deep neural network (DNN) workloads. Specifically, the paper proposes a new weight distribution algorithm - Sorted Weight Sectioning (SWS) - to optimize the implementation of unstructured sparse DNNs on CIM crossbars. ### Main Problems and Solutions: 1. **High Energy Consumption Problem**: - **Background**: When CIM crossbars perform DNN acceleration tasks, the energy consumed by ADCs accounts for about 85% of the total energy, becoming the main bottleneck for energy efficiency improvement. - **Challenge**: How to reduce the energy consumption of ADCs by effectively using the sparsity and weight distribution characteristics in DNNs while maintaining the accuracy of the model. 2. **Hardware Tracking Difficulties Caused by Unstructured Sparsity**: - **Background**: Although unstructured pruning can increase sparsity and reduce model accuracy loss, it is difficult to efficiently track zero - weights on hardware. - **Challenge**: How to efficiently handle zero - weights in unstructured sparse models without significantly affecting model performance. ### Solution Overview: - **Sorted Weight Sectioning (SWS)**: - **Principle**: By sorting DNN weights according to their magnitudes and mapping these weights in sections to the low - order columns of CIM crossbars, the number and resolution of required ADCs are reduced. - **Steps**: 1. Sort the weight vector according to the weight magnitudes. 2. Divide the sorted weight vector into multiple sections, with each section corresponding to a row of weight values. 3. Program each section into the CIM crossbar. 4. Adjust the activation vector according to the sorted weight order to ensure the correctness of the dot - product operation. ### Formula Explanation: Suppose $ W $ is a random variable following a normal distribution $ N(0, \sigma) $ (an approximate distribution of pre - trained weights). Define symmetric regions: \[ S_k = (-w_k, -w_{k - 1}) \cup (w_{k - 1}, w_k) \] \[ S_{k + 1} = (-w_{k + 1}, -w_k) \cup (w_k, w_{k + 1}) \] where $ 0 < w_{k - 1} < w_k < w_{k + 1} $ (sorted - weight assumption). For $ w \in S_k $ and $ w' \in S_{k + 1} $, their binary representations are: \[ w = \sum_{i = 0}^{b - 1} a_i 2^{-i} \] \[ w' = \sum_{i = 0}^{b - 1} a'_i 2^{-i} \] where $ a_i, a'_i \in \{0, 1\} $. The goal is to prove that for any $ i $ ($ 0 \leq i \leq b - 1 $): \[ P(a_i = 0) > P(a'_i = 0) \] Since $ S_k $ is closer to zero and the normal distribution is symmetrically decreasing, the probability that $ a_n = 0 $ in $ S_k $ is greater than that in $ S_{k + 1} $. ### Experimental Results: Experiments show that using the SWS method can significantly reduce ADC energy consumption without significantly reducing model accuracy. For example, on the ImageNet - 1K dataset, SWS reduces ADC energy consumption by 75.70%, while the accuracy only drops by 0.07% at a fixed resolution. ### Summary: This paper successfully solves the high - energy - consumption problem of CIM crossbars when processing unstructured sparse DNNs by introducing the SWS algorithm, significantly improves energy efficiency, and provides a new research direction for future CIM - based DNN implementations.

Sorted Weight Sectioning for Energy-Efficient Unstructured Sparse DNNs on Compute-in-Memory Crossbars

EWS: an Energy-Efficient CNN Accelerator with Enhanced Weight Stationary Dataflow

WAGONN: Weight Bit Agglomeration in Crossbar Arrays for Reduced Impact of Interconnect Resistance on DNN Inference Accuracy

Efficient Reprogramming of Memristive Crossbars for DNNs: Weight Sorting and Bit Stucking

SP-IMC: A Sparsity Aware In-Memory-Computing Macro in 28nm CMOS with Configurable Sparse Representation for Highly Sparse DNN Workloads

A Heuristic and Greedy Weight Remapping Scheme with Hardware Optimization for Irregular Sparse Neural Networks Implemented on CIM Accelerator in Edge AI Applications

SoBS-X: Squeeze-Out Bit Sparsity for ReRAM-Crossbar-Based Neural Network Accelerator.

Pruning for Improved ADC Efficiency in Crossbar-based Analog In-memory Accelerators

SmartDeal: Remodeling Deep Network Weights for Efficient Inference and Training

LauWS: Local Adaptive Unstructured Weight Sparsity of Load Balance for DNN in Near-Data Processing

Structured Weight Matrices-Based Hardware Accelerators in Deep Neural Networks: FPGAs and ASICs

Sparsity-Aware Non-Volatile Computing-In-Memory Macro with Analog Switch Array and Low-Resolution Current-Mode ADC.

An area and energy efficient design of domain-wall memory-based deep convolutional neural networks using stochastic computing

CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor with Dynamic Sub-Structured Weight Pruning

Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design

Sparse Systolic Tensor Array for Efficient CNN Hardware Acceleration

An Ultra-Efficient Memristor-Based DNN Framework with Structured Weight Pruning and Quantization Using ADMM

Tiny but Accurate: A Pruned, Quantized and Optimized Memristor Crossbar Framework for Ultra Efficient DNN Implementation

AutoWS: Automate Weights Streaming in Layer-wise Pipelined DNN Accelerators

Weight and Multiply-Accumulation Sparsity-Aware Non-Volatile Computing-in-Memory System