Abstract:Traditional digital implementations of neural accelerators are limited by high power and area overheads, while analog and non-CMOS implementations suffer from noise, device mismatch, and reliability issues. This paper introduces a CMOS Look-Up Table (LUT)-based Neural Accelerator (LUT-NA) framework that reduces the power, latency, and area consumption of traditional digital accelerators through pre-computed, faster look-ups while avoiding noise and mismatch of analog circuits. To solve the scalability issues of conventional LUT-based computation, we split the high-precision multiply and accumulate (MAC) operations into lower-precision MACs using a divide-and-conquer-based approach. We show that LUT-NA achieves up to $29.54\times$ lower area with $3.34\times$ lower energy per inference task than traditional LUT-based techniques and up to $1.23\times$ lower area with $1.80\times$ lower energy per inference task than conventional digital MAC-based techniques (Wallace Tree/Array Multipliers) without retraining and without affecting accuracy, even on lottery ticket pruned (LTP) models that already reduce the number of required MAC operations by up to 98%. Finally, we introduce mixed precision analysis in LUT-NA framework for various LTP models (VGG11, VGG19, Resnet18, Resnet34, GoogleNet) that achieved up to $32.22\times$-$50.95\times$ lower area across models with $3.68\times$-$6.25\times$ lower energy per inference than traditional LUT-based techniques, and up to $1.35\times$-$2.14\times$ lower area requirement with $1.99\times$-$3.38\times$ lower energy per inference across models as compared to conventional digital MAC-based techniques with $\sim$1% accuracy loss.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the high power consumption, large chip area, and low energy efficiency faced by traditional neural network accelerators when processing deep neural networks (DNN). Specifically: 1. **Limitations of traditional digital implementation**: - Neural accelerators implemented in the traditional digital way have difficulty in efficiently handling complex DNN workloads due to high power consumption and large chip area overhead. - Although analog and non - CMOS implementations can improve energy efficiency, they are vulnerable to noise, device mismatch, and reliability issues. 2. **Scalability and energy - efficiency challenges**: - As the complexity of neural network models increases, the need for faster processing speed and efficient memory usage becomes more crucial. - It is especially important to achieve efficient neural network computing on resource - constrained nodes such as biomedical implants and wearable devices. To solve these problems, the author proposes a neural accelerator framework based on look - up tables (LUT) (LUT - NA), aiming to reduce power consumption, latency, and chip area consumption through pre - calculated fast look - up tables while avoiding noise and mismatch problems in analog circuits. ### Main contributions 1. **A programmable and scalable LUT - NA framework**: - A novel divide - and - conquer method (D&C) is proposed to implement LUT - NA, making the LUT architecture scalable across multiple DNN models and bit resolutions. 2. **Mixed - precision analysis and approximate computing**: - The concepts of mixed - precision analysis and approximate computing are introduced, further reducing energy and area consumption while only sacrificing about 1% of the accuracy. 3. **Lottery ticket mechanism pruning (LTP) combined with LUT - NA**: - On models where the number of MAC operations has been significantly reduced by LTP, the scalability of LUT - NA is further improved. 4. **Hardware efficiency analysis**: - Hardware efficiency (energy consumption and area consumption per inference) analysis of LUT - NA and approximate/mixed - precision LUT - NA for different deep - learning models is carried out. Through these methods, the LUT - NA framework significantly improves the energy efficiency and scalability of neural network accelerators, is applicable to multiple deep - learning models, and performs well in resource - constrained environments.

Look-Up Table based Neural Network Hardware

LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference

LUTNet: Learning FPGA Configurations for Highly Efficient Neural Network Inference

Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic

LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup

QuantMAC: Enhancing Hardware Performance in DNNs With Quantize Enabled Multiply-Accumulate Unit

Logic Design of Neural Networks for High-Throughput and Low-Power Applications

Neural Synaptic Plasticity-Inspired Computing: A High Computing Efficient Deep Convolutional Neural Network Accelerator

NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions

Low-Complexity Precision-Scalable Multiply-Accumulate Unit Architectures for Deep Neural Network Accelerators

EncodingNet: A Novel Encoding-based MAC Design for Efficient Neural Network Acceleration

LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Dynamic Power Control in a Hardware Neural Network with Error-Configurable MAC Units

PolyLUT-Add: FPGA-based LUT Inference with Wide Inputs

NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

A Low-Power Accelerator for Deep Neural Networks with Enlarged Near-Zero Sparsity

PIXEL: Photonic Neural Network Accelerator

Bit-pragmatic Deep Neural Network Computing

Efficient Hardware Optimization Strategies For Deep Neural Networks Acceleration Chip