Abstract:We propose a co-design approach for compute-in-memory inference for deep neural networks (DNN). We use multiplication-free function approximators based on $ell _{1}$ norm along with a co-adapted processing array and compute flow. Using the approach, we overcame many deficiencies in the current art of in-SRAM DNN processing such as the need for digital-to-analog converters (DACs) at each operating SRAM row/column, the need for high precision analog-to-digital converters (ADCs), limited support for multi-bit precision weights, and limited vector-scale parallelism. Our co-adapted implementation seamlessly extends to multi-bit precision weights, it doesn't require DACs, and it easily extends to higher vector-scale parallelism. We also propose an SRAM-immersed successive approximation ADC (SA-ADC), where we exploit the parasitic capacitance of bit lines of SRAM array as a capacitive DAC. Since the dominant area overhead in SA-ADC comes due to its capacitive DAC, by exploiting the intrinsic parasitic of SRAM array, our approach allows low area implementation of within-SRAM SA-ADC. Our $8times 62$ SRAM macro, which requires a 5-bit ADC, achieves ~105 tera operations per second per Watt (TOPS/W) with 8-bit input/weight processing at 45 nm CMOS. Our $8times 30$ SRAM macro, which requires a 4-bit ADC, achieves ~84 TOPS/W. SRAM macros that require lower ADC precision are more tolerant of process variability, however, have lower TOPS/W as well. We evaluated the accuracy and performance of our proposed network for MNIST, CIFAR10, and CIFAR100 datasets. We chose a network configuration which adaptively mixes multiplication-free and regular operators. The network configura-ions utilize the multiplication-free operator for more than 85% operations from the total. The selected configurations are 98.6% accurate for MNIST, 90.2% for CIFAR10, and 66.9% for CIFAR100. Since most of the operations in the considered configurations are based on proposed SRAM macros, our compute-in-memory's efficiency benefits broadly translate to the system-level.

Assessment of inference accuracy and memory capacity of computation-in-memory enabled neural network due to quantized weights, gradients, input and output signals, and memory non-idealities

PIM-QAT: Neural Network Quantization for Processing-In-Memory (PIM) Systems

Efficient Neural Compression with Inference-time Decoding

QuantBayes: Weight Optimization for Memristive Neural Networks via Quantization-Aware Bayesian Inference

Enhancing in-situ updates of quantized memristor neural networks: a Siamese network learning approach

Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

CIMQ: A Hardware-Efficient Quantization Framework for Computing-In-Memory Based Neural Network Accelerators

Improving the Accuracy of Neural Networks in Analog Computing-in-memory Systems by a Generalized Quantization Method

Accelerating Neural Network Inference by Overflow Aware Quantization

MF-Net: Compute-In-Memory SRAM for Multibit Precision Inference Using Memory-Immersed Data Conversion and Multiplication-Free Operators

Towards Efficient In-memory Computing Hardware for Quantized Neural Networks: State-of-the-art, Open Challenges and Perspectives

Mixed Precision Quantization for ReRAM-based DNN Inference Accelerators

Low Quantization Error Readout Circuit with Fully Charge-Domain Calculation for Computation-in-Memory Deep Neural Network

Quantized Memory-Augmented Neural Networks

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

An In-Memory-Computing Structure with Quantum-Dot Transistor Toward Neural Network Applications: From Analog Circuits to Memory Arrays

Improving the accuracy of neural networks in analog computing-in-memory systems by analog weight.

Memristor Based Mixed-Accuracy Computation-in-Memory System.

Device Variation Effects on Neural Network Inference Accuracy in Analog In‐Memory Computing Systems

A Hybrid RRAM-SRAM Computing-In-Memory Architecture for Deep Neural Network Inference-Training Edge Acceleration

Memory Faults in Activation-sparse Quantized Deep Neural Networks: Analysis and Mitigation using Sharpness-aware Training