Abstract:In response to the escalating demand for hardware-efficient Deep Neural Network (DNN) architectures, we present a novel quantize-enabled multiply-accumulate (MAC) unit. Our methodology employs a right shift-and-add computation for MAC operation, enabling runtime truncation without additional hardware. This architecture optimally utilizes hardware resources, enhancing throughput performance while reducing computational complexity through bit-truncation techniques. Our key methodology involves designing a hardware-efficient MAC computational algorithm that supports both iterative and pipeline implementations, catering to diverse hardware efficiency or enhanced throughput requirements in accelerators. Additionally, we introduce a processing element (PE) with a pre-loading bias scheme, reducing one clock delay and eliminating the need for conventional extra resources in PE implementation. The PE facilitates quantization-based MAC calculations through an efficient bit-truncation method, removing the necessity for extra hardware logic. This versatile PE accommodates variable bit-precision with a dynamic fraction part within the sfxpt< N,f representation, meeting specific model or layer demands. Through software emulation, our proposed approach demonstrates minimal accuracy loss, revealing under 1.6% loss for LeNet-5 using MNIST and around 4% for ResNet-18 and VGG-16 with CIFAR-10 in the sfxpt< 8 ,5 format compared to conventional float32-based implementations. Hardware performance parameters on the Xilinx-Virtex-7 board unveil a 37% reduction in area utilization and a 45% reduction in power consumption compared to the best state-of-the-art MAC architecture. Extending the proposed MAC to a LeNet DNN model results in a 42% reduction in resource requirements and a significant 27% reduction in delay. This architecture provides notable advantages for resource-efficient, high-throughput edge-AI applications.

CANET: Quantized Neural Network Inference With 8-bit Carry-Aware Accumulator

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

Integer-Only CNNs with 4 Bit Weights and Bit-Shift Quantization Scales at Full-Precision Accuracy

Improving the Accuracy of Neural Networks in Analog Computing-in-memory Systems by a Generalized Quantization Method

Accelerating Neural Network Inference by Overflow Aware Quantization

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

131TOPS/W 8b ACIM Exploiting Weight-Embedded Auto-Accumulation and Supporting Symmetric Quantization Networks

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Improving the accuracy of neural networks in analog computing-in-memory systems by analog weight.

Redistribution of Weights and Activations for AdderNet Quantization

CAQ: Context-Aware Quantization via Reinforcement Learning

4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

Efficient Quantization for Neural Networks with Binary Weights and Low Bitwidth Activations.

A Communication-Aware DNN Accelerator on ImageNet Using In-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65-nm CMOS

A 4-Bit Integer-Only Neural Network Quantization Method Based on Shift Batch Normalization

Efficient Neural Compression with Inference-time Decoding

EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration

Custom Network Quantization Method for Lightweight CNN Acceleration on FPGAs