Abstract:Domain-specific hardware is becoming a promising topic in the backdrop of improvement slow down for general-purpose processors due to the foreseeable end of Moore's Law. Machine learning, especially deep neural networks (DNNs), has become the most dazzling domain witnessing successful applications in a wide spectrum of artificial intelligence (AI) tasks. The incomparable accuracy of DNNs is achieved by paying the cost of hungry memory consumption and high computational complexity, which greatly impedes their deployment in embedded systems. Therefore, the DNN compression concept was naturally proposed and widely used for memory saving and compute acceleration. In the past few years, a tremendous number of compression techniques have sprung up to pursue a satisfactory tradeoff between processing efficiency and application accuracy. Recently, this wave has spread to the design of neural network accelerators for gaining extremely high performance. However, the amount of related works is incredibly huge and the reported approaches are quite divergent. This research chaos motivates us to provide a comprehensive survey on the recent advances toward the goal of efficient compression and execution of DNNs without significantly compromising accuracy, involving both the high-level algorithms and their applications in hardware design. In this article, we review the mainstream compression approaches such as compact model, tensor decomposition, data quantization, and network sparsification. We explain their compression principles, evaluation metrics, sensitivity analysis, and joint-way use. Then, we answer the question of how to leverage these methods in the design of neural network accelerators and present the state-of-the-art hardware architectures. In the end, we discuss several existing issues such as fair comparison, testing workloads, automatic compression, influence on security, and framework/hardware-level support, and give promising topics in this field and the possible challenges as well. This article attempts to enable readers to quickly build up a big picture of neural network compression and acceleration, clearly evaluate various methods, and confidently get started in the right way.

Downscaling and Overflow-aware Model Compression for Efficient Vision Processors

MCMC: Multi-Constrained Model Compression Via One-Stage Envelope Reinforcement Learning.

A Compression Pipeline for One-Stage Object Detection Model

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

Edge AI: Evaluation of Model Compression Techniques for Convolutional Neural Networks

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Single-path Bit Sharing for Automatic Loss-aware Model Compression

Efficient Network Compression Through Smooth-Lasso Constraint

Iterative Deep Model Compression and Acceleration in the Frequency Domain.

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference

High Performance CNN Accelerators Based on Hardware and Algorithm Co-Optimization

Learning Low Resource Consumption CNN through Pruning and Quantization

An Efficient CNN Inference Accelerator Based on Intra- and Inter-Channel Feature Map Compression

Model Compression and Hardware Acceleration for Neural Networks: A Comprehensive Survey

Model Compression for Deep Neural Networks: A Survey

Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression

DNN Model Compression for IoT Domain-Specific Hardware Accelerators

COMCAT: Towards Efficient Compression and Customization of Attention-Based Vision Models

Research on Model Compression for Embedded Platform through Quantization and Pruning

Pruning at a Glance: Global Neural Pruning for Model Compression