Abstract:In response to the escalating demand for hardware-efficient Deep Neural Network (DNN) architectures, we present a novel quantize-enabled multiply-accumulate (MAC) unit. Our methodology employs a right shift-and-add computation for MAC operation, enabling runtime truncation without additional hardware. This architecture optimally utilizes hardware resources, enhancing throughput performance while reducing computational complexity through bit-truncation techniques. Our key methodology involves designing a hardware-efficient MAC computational algorithm that supports both iterative and pipeline implementations, catering to diverse hardware efficiency or enhanced throughput requirements in accelerators. Additionally, we introduce a processing element (PE) with a pre-loading bias scheme, reducing one clock delay and eliminating the need for conventional extra resources in PE implementation. The PE facilitates quantization-based MAC calculations through an efficient bit-truncation method, removing the necessity for extra hardware logic. This versatile PE accommodates variable bit-precision with a dynamic fraction part within the sfxpt< N,f representation, meeting specific model or layer demands. Through software emulation, our proposed approach demonstrates minimal accuracy loss, revealing under 1.6% loss for LeNet-5 using MNIST and around 4% for ResNet-18 and VGG-16 with CIFAR-10 in the sfxpt< 8 ,5 format compared to conventional float32-based implementations. Hardware performance parameters on the Xilinx-Virtex-7 board unveil a 37% reduction in area utilization and a 45% reduction in power consumption compared to the best state-of-the-art MAC architecture. Extending the proposed MAC to a LeNet DNN model results in a 42% reduction in resource requirements and a significant 27% reduction in delay. This architecture provides notable advantages for resource-efficient, high-throughput edge-AI applications.

PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In-Sensor-Computed Deep Learning Accelerators

A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations.

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

A 4-Bit Mixed-Signal MAC Macro with One-Shot ADC Conversion.

TMA: Tera-MACs/W Neural Hardware Inference Accelerator with a Multiplier-less Massive Parallel Processor

A Brain-Inspired ADC-Free SRAM-Based In-Memory Computing Macro with High-Precision MAC for AI Application

QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

PS-IMC: A 2385.7 TOPS/W/b Precision Scalable In-Memory Computing Macro with Bit-Parallel Inputs and Decomposable Weights for DNNs

Audio and Image Cross-Modal Intelligence Via a 10TOPS/W 22nm SoC with Back-Propagation and Dynamic Power Gating

A Low-Power Charge-Domain Bit-Scalable Readout System for Fully-Parallel Computing-in-Memory Accelerators

A 177 TOPS/W, Capacitor-based In-Memory Computing SRAM Macro with Stepwise-Charging/Discharging DACs and Sparsity-Optimized Bitcells for 4-Bit Deep Convolutional Neural Networks.

A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS

A Charge-Domain Scalable-Weight In-Memory Computing Macro With Dual-SRAM Architecture for Precision-Scalable DNN Accelerators

A 768.7-2124.2 TOPS/W Time-Domain Computing-in-Memory Macro with Low Static Leakage and Precision-Configurable TDC

An 11T1C Bit-Level-Sparsity-Aware Computing-in-Memory Macro with Adaptive Conversion Time and Computation Voltage

An SRAM Compute-In-Memory Macro Based on Direct Coupling SAR ADC and DAC Reuse

EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration

A 16-Channel Fully Configurable Neural SoC With 1.52 μW/Ch Signal Acquisition, 2.79 μW/Ch Real-Time Spike Classifier, and 1.79 TOPS/W Deep Neural Network Accelerator in 22 nm FDSOI

7.8 A 22nm Delta-Sigma Computing-In-Memory (Δ∑CIM) SRAM Macro with Near-Zero-Mean Outputs and LSB-First ADCs Achieving 21.38TOPS/W for 8b-MAC Edge AI Processing

MANTIS: A Mixed-Signal Near-Sensor Convolutional Imager SoC Using Charge-Domain 4b-Weighted 5-to-84-TOPS/W MAC Operations for Feature Extraction and Region-of-Interest Detection

MAC-DO: An Efficient Output-Stationary GEMM Accelerator for CNNs Using DRAM Technology