Abstract:Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore's Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

Design of High Performance RNN Accelerator Based on Network Compression

High-performance Reconfigurable DNN Accelerator on a Bandwidth-limited Embedded System

A 3.89-Gops/mw Scalable Recurrent Neural Network Processor with Improved Efficiency on Memory and Computation

Towards ReRAM-based Accelerator:An Energy-efficient NN Model Compression Framework

RRAM-DNN: an RRAM and Model-Compression Empowered All-Weights-On-Chip DNN Accelerator

An Energy-Efficient Spiking Neural Network Accelerator Based on Spatio-Temporal Redundancy Reduction

Design of a Convolutional Neural Network Accelerator Based on On-Chip Data Reordering

A fine-grained mixed precision DNN accelerator using a two-stage big-little core RISC-V MCU.

Exploring Resource-Aware Deep Neural Network Accelerator and Architecture Design

A Fine-Grained Sparse Accelerator for Multi-Precision DNN.

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

A Convolutional Spiking Neural Network Accelerator with the Sparsity-Aware Memory and Compressed Weights

An Efficient Spiking Neural Network Accelerator with Sparse Weight.

Block-Circulant Neural Network Accelerator Featuring Fine-Grained Frequency-Domain Quantization and Reconfigurable FFT Modules

ReCom: an Efficient Resistive Accelerator for Compressed Deep Neural Networks.

Balancing memory-accessing and computing over sparse DNN accelerator via efficient data packaging

A Low-Power Sparse Convolutional Neural Network Accelerator with Pre-Encoding Radix-4 Booth Multiplier

A Low-Latency DNN Accelerator Enabled by DFT-Based Convolution Execution Within Crossbar Arrays

SparseNN: A Performance-Efficient Accelerator for Large-Scale Sparse Neural Networks

ReRAM-Sharing: Fine-Grained Weight Sharing for ReRAM-Based Deep Neural Network Accelerator.

High Area/Energy Efficiency RRAM CNN Accelerator with Pattern-Pruning-Based Weight Mapping Scheme