Abstract:The study of deep neural networks has recently gained widespread attention in recent years, with many researchers proposing network structures that exhibit exceptional performance. A current trend in artificial intelligence (AI) technology involves using deep learning and its applications via large-scale pretrained deep neural network models. This approach aims to improve the generalization capability and task-specific performance of the model, particularly in areas such as computer vision and natural language processing. Despite their success, the deployment of high-performance deep neural network models on edge hardware platforms, such as household appliances and smartphones, remains challenging owing to the high complexity of the neural network architecture, substantial storage overhead, and computational costs. These factors hinder the availability of AI technologies to the public. Therefore, compressing and accelerating deep neural network models have become a critical issue in the promotion of their large-scale commercial applications. Owing to the growing support for low-precision computation technology provided by AI hardware manufacturers, model quantization has emerged as a promising approach for the compression and acceleration of machine learning models. By reducing the bit width of deep neural network model parameters and intermediate feature maps during the forward propagation of the model, memory usage, computation efficiency, and energy consumption can be substantially reduced, enabling the utilization of quantized deep neural network models in resource-limited edge devices. However, this approach involves a critical tradeoff between task performance and hardware deployment, which directly impacts its potential for practical application. Quantizing the model to a low-bit precision can lead to considerable information loss, often resulting in a catastrophic degradation of the task performance of the model. Thus, alleviating the challenges of model quantization while maintaining task performance has become a critical research topic in AI. Furthermore, because of the differences in hardware devices, constraints of application scenarios, and data accessibility, model quantization has become a multibranch problem, including data-dependent, data-free, mixed-precision, and extremely low-bit quantization, among others. By comprehensively investigating various quantization methods for deep neural networks proposed based on different perspectives, and summarizing their advantages and disadvantages thoroughly, the essential problems that are associated with the quantization of deep neural network quantization can be explored, which points out the directions for possible future developments.

Μl2q: an Ultra-Low Loss Quantization Method for DNN Compression

VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization

Deep Neural Network Compression With Single and Multiple Level Quantization

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Quantization Networks

QD-Compressor: a Quantization-based Delta Compression Framework for Deep Neural Networks

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data.

A Novel Low-Bit Quantization Strategy for Compressing Deep Neural Networks

MedQ: Lossless ultra-low-bit neural network quantization for medical image segmentation

Toward Extremely Low Bit and Lossless Accuracy in DNNs with Progressive ADMM

CNQ: Compressor‐Based Non‐uniform Quantization of Deep Neural Networks

Cnq: Compressor-Based Non-Uniform Quantization Of Deep Neural Networksinspec Keywordsother Keywordskey Words

ECQ$^{\text{x}}$: Explainability-Driven Quantization for Low-Bit and Sparse DNNs

AutoQNN: An End-to-End Framework for Automatically Quantizing Neural Networks

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

Bit-Quantized-Net: an Effective Method for Compressing Deep Neural Networks.

Weight Normalization based Quantization for Deep Neural Network Compression

A Survey of Quantization Methods for Deep Neural Networks

SQuant: On-the-Fly Data-Free Quantization Via Diagonal Hessian Approximation

Learning Accurate Low-bit Quantization towards Efficient Computational Imaging

SearchQ: Search-based Fine-Grained Quantization for Data-Free Model Compression