Abstract:We propose a co-design approach for compute-in-memory inference for deep neural networks (DNN). We use multiplication-free function approximators based on $ell _{1}$ norm along with a co-adapted processing array and compute flow. Using the approach, we overcame many deficiencies in the current art of in-SRAM DNN processing such as the need for digital-to-analog converters (DACs) at each operating SRAM row/column, the need for high precision analog-to-digital converters (ADCs), limited support for multi-bit precision weights, and limited vector-scale parallelism. Our co-adapted implementation seamlessly extends to multi-bit precision weights, it doesn't require DACs, and it easily extends to higher vector-scale parallelism. We also propose an SRAM-immersed successive approximation ADC (SA-ADC), where we exploit the parasitic capacitance of bit lines of SRAM array as a capacitive DAC. Since the dominant area overhead in SA-ADC comes due to its capacitive DAC, by exploiting the intrinsic parasitic of SRAM array, our approach allows low area implementation of within-SRAM SA-ADC. Our $8times 62$ SRAM macro, which requires a 5-bit ADC, achieves ~105 tera operations per second per Watt (TOPS/W) with 8-bit input/weight processing at 45 nm CMOS. Our $8times 30$ SRAM macro, which requires a 4-bit ADC, achieves ~84 TOPS/W. SRAM macros that require lower ADC precision are more tolerant of process variability, however, have lower TOPS/W as well. We evaluated the accuracy and performance of our proposed network for MNIST, CIFAR10, and CIFAR100 datasets. We chose a network configuration which adaptively mixes multiplication-free and regular operators. The network configura-ions utilize the multiplication-free operator for more than 85% operations from the total. The selected configurations are 98.6% accurate for MNIST, 90.2% for CIFAR10, and 66.9% for CIFAR100. Since most of the operations in the considered configurations are based on proposed SRAM macros, our compute-in-memory's efficiency benefits broadly translate to the system-level.

dCSR: A Memory-Efficient Sparse Matrix Representation for Parallel Neural Network Inference

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

Diagonally-Addressed Matrix Nicknack: How to improve SpMV performance

Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference

Containing Analog Data Deluge at Edge through Frequency-Domain Compression in Collaborative Compute-in-Memory Networks

SP-IMC: A Sparsity Aware In-Memory-Computing Macro in 28nm CMOS with Configurable Sparse Representation for Highly Sparse DNN Workloads

A Digital SRAM Computing-in-Memory Design Utilizing Activation Unstructured Sparsity for High-Efficient DNN Inference

Twofold Sparsity: Joint Bit- and Network-Level Sparsity for Energy-Efficient Deep Neural Network Using RRAM Based Compute-In-Memory

A 65 Nm 73 Kb SRAM-Based Computing-In-Memory Macro with Dynamic-Sparsity Controlling

Near-Memory Parallel Indexing and Coalescing: Enabling Highly Efficient Indirect Access for SpMV

Neural networks on microcontrollers: saving memory at inference via operator reordering

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

SparCE: Sparsity aware General Purpose Core Extensions to Accelerate Deep Neural Networks

EAST: Encoding-Aware Sparse Training for Deep Memory Compression of ConvNets

Efficient Neural Network Deployment for Microcontroller

Eidetic: An In-Memory Matrix Multiplication Accelerator for Neural Networks

IndexMAC: A Custom RISC-V Vector Instruction to Accelerate Structured-Sparse Matrix Multiplications

Energy-efficient SNN Architecture using 3nm FinFET Multiport SRAM-based CIM with Online Learning

Incremental Training and Group Convolution Pruning for Runtime DNN Performance Scaling on Heterogeneous Embedded Platforms

MF-Net: Compute-In-Memory SRAM for Multibit Precision Inference Using Memory-Immersed Data Conversion and Multiplication-Free Operators

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers