Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

Jannis Schönleber,Lukas Cavigelli,Renzo Andri,Matteo Perotti,Luca Benini

2023-11-17

Abstract:From classical HPC to deep learning, MatMul is at the heart of today's computing. The recent Maddness method approximates MatMul without the need for multiplication by using a hash-based version of product quantization (PQ) indexing into a look-up table (LUT). Stella Nera is the first Maddness accelerator and it achieves 15x higher area efficiency (GMAC/s/mm^2) and more than 25x higher energy efficiency (TMAC/s/W) than direct MatMul accelerators implemented in the same technology. The hash function is a decision tree, which allows for an efficient hardware implementation as the multiply-accumulate operations are replaced by decision tree passes and LUT lookups. The entire Maddness MatMul can be broken down into parts that allow an effective implementation with small computing units and memories, allowing it to reach extreme efficiency while remaining generically applicable for MatMul tasks. In a commercial 14nm technology and scaled to 3nm, we achieve an energy efficiency of 161 TOp/s/W@0.55V with a Top-1 accuracy on CIFAR-10 of more than 92.5% using ResNet9.

Hardware Architecture,Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the efficiency of matrix multiplication (MatMul) in deep - learning computations, especially the energy efficiency and area efficiency on hardware accelerators. With the continuous growth of deep - learning models, the demand for computing speed and efficiency is increasing day by day, but traditional multiplication - based matrix - multiplication accelerators have encountered bottlenecks in energy efficiency. The paper proposes a method named Maddness, which approximates matrix multiplication by using hash functions and look - up tables (LUT) to avoid multiplication operations, achieving higher energy efficiency and area efficiency. Specifically, the main contributions of the paper include: 1. **Stella Nera**: This is an open - source and fully parameterized implementation of the Maddness hardware accelerator, which can achieve an energy efficiency of 161 tera - operations per watt (TOp/s/W) in commercial 14 - nanometer technology and is further improved in 3 - nanometer technology. 2. **Differentiable Maddness**: The paper proposes the first differentiable Maddness method, allowing this method to be used for the training of deep neural networks (DNN). 3. **PyTorch implementation**: It provides a well - tested PyTorch implementation of the differentiable Maddness linear layer and convolutional layer. 4. **Experimental results**: A top - 1 accuracy of 92.6% was achieved on the CIFAR - 10 dataset using the ResNet9 model, which is only 1.2% different from the FP32 baseline. Through these contributions, the paper aims to provide an efficient and general - purpose matrix - multiplication approximation method suitable for large - scale deep - learning tasks and capable of achieving extremely high energy efficiency and area efficiency on hardware.

Stella Nera: Achieving 161 TOp/s/W with Multiplier-free DNN Acceleration based on Approximate Matrix Multiplication

A Low-Power In-Memory Multiplication and Accumulation Array with Modified Radix-4 Input and Canonical Signed Digit Weights

DaDianNao: A Machine-Learning Supercomputer

RedMule: A Mixed-Precision Matrix-Matrix Operation Engine for Flexible and Energy-Efficient On-Chip Linear Algebra and TinyML Training Acceleration

A Heterogeneous Stochastic Computing Multiplier for Universally Accurate and Energy-Efficient DNNs

High-Level Design of Precision-Scalable DNN Accelerators Based on Sum-Together Multipliers

FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits!

Parallel Photonic Acceleration Processor for Matrix-Matrix Multiplication

StoX-Net: Stochastic Processing of Partial Sums for Efficient In-Memory Computing DNN Accelerators

DLUX: A LUT-Based Near-Bank Accelerator for Data Center Deep Learning Training Workloads

A Conv‐GEMM reconfigurable accelerator with WS‐RS dataflow for high throughput processing

Fast, Scalable, Energy-Efficient Non-element-wise Matrix Multiplication on FPGA

DGEMM on Integer Matrix Multiplication Unit

NeuralMatrix: Compute the Entire Neural Networks with Linear Matrix Operations for Efficient Inference

7.2 A 12nm Programmable Convolution-Efficient Neural-Processing-Unit Chip Achieving 825TOPS

Bit-pragmatic Deep Neural Network Computing

A Low-Cost Implementation Method on Deep Neural Network Using Stochastic Computing

PuDianNao: A Polyvalent Machine Learning Accelerator

NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks

Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.