Abstract:In-memory computing hardware accelerators allow more than 10x improvements in peak efficiency and performance for matrix-vector multiplications (MVM) compared to conventional digital designs. For this, they have gained great interest for the acceleration of neural network workloads. Nevertheless, these potential gains are only achieved when the utilization of the computational resources is maximized and the overhead from loading operands in the memory array minimized. To this aim, this paper proposes a novel mapping algorithm for the weights in the IMC macro, based on efficient packing of the weights of network layers in the available memory. The algorithm realizes 1) minimization of weight loading times while at the same time 2) maximally exploiting the parallelism of the IMC computational fabric. A set of case studies are carried out to show achievable trade-offs for the MLPerf Tiny benchmark \cite{mlperftiny} on IMC architectures, with potential $10-100\times$ EDP improvements.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two major key problems encountered when running neural network workloads in In - Memory Computing (IMC) accelerators: 1. **Weight loading overhead**: - **Energy consumption and latency**: Weight loading not only consumes a large amount of energy but also increases latency. Each loading requires obtaining data from outside the IMC macro (such as DRAM), rearranging it, and loading it into the memory array. - **Stalls due to frequent updates**: Weight loading and calculation cannot be carried out in parallel in the same memory macro, which will lead to an inherent stall, especially when the weights need to be updated frequently. - **Limitations of off - chip storage**: Weights are usually stored in off - chip memory (such as DRAM), and stalls cannot be avoided due to insufficient bandwidth and high access energy consumption. 2. **Insufficient utilization of computing resources**: - **Under - utilization of spatial parallelism**: When performing matrix - vector multiplication (MVM) operations, the spatial parallelism in IMC designs is not fully utilized, affecting performance and efficiency. - **Limited dataflow flexibility**: Existing IMC architectures lack flexibility in terms of dataflow, resulting in the inability to maximize the use of available computing resources. ### Solutions To solve the above problems, the paper proposes a new mapping algorithm for efficiently packing the weights of each layer of the neural network in the IMC macro. The specific objectives are: 1. **Minimize weight loading time**: By packing weights compactly, reduce the need to load weights from external memory, thereby reducing energy consumption and latency. 2. **Maximize the parallelism of computing resources**: By optimizing the layout of weights in the IMC macro, make full use of the spatial parallelism of the IMC computing structure and improve computing efficiency. ### Method overview The solution proposed in the paper includes the following steps: 1. **Generate initial weight blocks**: Define a set of uniform weight blocks according to the dimensions of the IMC macro, and consider combining these blocks to expand the pool (called super - blocks). 2. **Super - block generation**: Stack multiple original weight blocks along the time - multiplexing dimension $D_m$ to form super - blocks. 3. **Column generation**: Find a dense allocation of super - blocks in the $D_i\times D_o$ space to ensure the spatial parallelism of each network layer. 4. **Column allocation to macros**: Allocate the generated columns to different IMC macros, ensuring that each macro contains only the weights of one layer, in order to maximize computing utilization. Through this series of steps, the paper shows how to effectively pack weights in the IMC architecture, thereby significantly reducing weight loading overhead and increasing the utilization of computing resources. ### Experimental results The paper verifies the effectiveness of the proposed method through a series of case studies. In particular, on the MLPerf Tiny benchmark, it shows a potential 10 - 100 - fold improvement in the energy - delay product (EDP).

Pack my weights and run! Minimizing overheads for in-memory computing accelerators

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A design framework for processing-in-memory accelerator

SP-IMC: A Sparsity Aware In-Memory-Computing Macro in 28nm CMOS with Configurable Sparse Representation for Highly Sparse DNN Workloads

Towards Efficient IMC Accelerator Design Through Joint Hardware-Workload Co-optimization

In-Memory Computing: Advances and Prospects

Analog or Digital In-memory Computing? Benchmarking through Quantitative Modeling

A Heterogeneous In-Memory Computing Cluster For Flexible End-to-End Inference of Real-World Deep Neural Networks

SPCIM: Sparsity-Balanced Practical CIM Accelerator with Optimized Spatial-Temporal Multi-Macro Utilization

An Emerging NVM CIM Accelerator with Shared-Path Transpose Read and Bit-Interleaving Weight Storage for Efficient On-Chip Training in Edge Devices

Compact Modeling and Mitigation of Parasitics in Crosspoint Accelerators of Neural Networks

Design and Implementation of a Charge-Sharing In-Memory-computing Macro with Sparse Feature for Quantized Neural Network

Quartet: A 22nm 0.09mj/lnference Digital Compute-in-Memory Versatile AI Accelerator with Heterogeneous Tensor Engines and Off-Chip-Less Dataflow

COMB-MCM: Computing-on-Memory-Boundary NN Processor with Bipolar Bitwise Sparsity Optimization for Scalable Multi-Chiplet-Module Edge Machine Learning.

APack: Off-Chip, Lossless Data Compression for Efficient Deep Learning Inference

Neural-PIM: Efficient Processing-In-Memory with Neural Approximation of Peripherals

A Weight Mapping Strategy for More Fully Exploiting Data in CIM-Based CNN Accelerator

An Energy-Efficient Near-Data Processing Accelerator for DNNs that Optimizes Data Accesses

Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

Benchmark of the Compute-in-Memory-Based DNN Accelerator With Area Constraint

An Algorithm-Hardware Co-design Framework to Overcome Imperfections of Mixed-signal DNN Accelerators