Pack my weights and run! Minimizing overheads for in-memory computing accelerators

Pouya Houshmand,Marian Verhelst
2024-09-16
Abstract:In-memory computing hardware accelerators allow more than 10x improvements in peak efficiency and performance for matrix-vector multiplications (MVM) compared to conventional digital designs. For this, they have gained great interest for the acceleration of neural network workloads. Nevertheless, these potential gains are only achieved when the utilization of the computational resources is maximized and the overhead from loading operands in the memory array minimized. To this aim, this paper proposes a novel mapping algorithm for the weights in the IMC macro, based on efficient packing of the weights of network layers in the available memory. The algorithm realizes 1) minimization of weight loading times while at the same time 2) maximally exploiting the parallelism of the IMC computational fabric. A set of case studies are carried out to show achievable trade-offs for the MLPerf Tiny benchmark \cite{mlperftiny} on IMC architectures, with potential $10-100\times$ EDP improvements.
Hardware Architecture,Image and Video Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve two major key problems encountered when running neural network workloads in In - Memory Computing (IMC) accelerators: 1. **Weight loading overhead**: - **Energy consumption and latency**: Weight loading not only consumes a large amount of energy but also increases latency. Each loading requires obtaining data from outside the IMC macro (such as DRAM), rearranging it, and loading it into the memory array. - **Stalls due to frequent updates**: Weight loading and calculation cannot be carried out in parallel in the same memory macro, which will lead to an inherent stall, especially when the weights need to be updated frequently. - **Limitations of off - chip storage**: Weights are usually stored in off - chip memory (such as DRAM), and stalls cannot be avoided due to insufficient bandwidth and high access energy consumption. 2. **Insufficient utilization of computing resources**: - **Under - utilization of spatial parallelism**: When performing matrix - vector multiplication (MVM) operations, the spatial parallelism in IMC designs is not fully utilized, affecting performance and efficiency. - **Limited dataflow flexibility**: Existing IMC architectures lack flexibility in terms of dataflow, resulting in the inability to maximize the use of available computing resources. ### Solutions To solve the above problems, the paper proposes a new mapping algorithm for efficiently packing the weights of each layer of the neural network in the IMC macro. The specific objectives are: 1. **Minimize weight loading time**: By packing weights compactly, reduce the need to load weights from external memory, thereby reducing energy consumption and latency. 2. **Maximize the parallelism of computing resources**: By optimizing the layout of weights in the IMC macro, make full use of the spatial parallelism of the IMC computing structure and improve computing efficiency. ### Method overview The solution proposed in the paper includes the following steps: 1. **Generate initial weight blocks**: Define a set of uniform weight blocks according to the dimensions of the IMC macro, and consider combining these blocks to expand the pool (called super - blocks). 2. **Super - block generation**: Stack multiple original weight blocks along the time - multiplexing dimension \(D_m\) to form super - blocks. 3. **Column generation**: Find a dense allocation of super - blocks in the \(D_i\times D_o\) space to ensure the spatial parallelism of each network layer. 4. **Column allocation to macros**: Allocate the generated columns to different IMC macros, ensuring that each macro contains only the weights of one layer, in order to maximize computing utilization. Through this series of steps, the paper shows how to effectively pack weights in the IMC architecture, thereby significantly reducing weight loading overhead and increasing the utilization of computing resources. ### Experimental results The paper verifies the effectiveness of the proposed method through a series of case studies. In particular, on the MLPerf Tiny benchmark, it shows a potential 10 - 100 - fold improvement in the energy - delay product (EDP).