Abstract:This paper presents incremental network quantization (INQ), a novel method, targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version whose weights are constrained to be either powers of two or zero. Unlike existing methods which are struggled in noticeable accuracy loss, our INQ has the potential to resolve this issue, as benefiting from two innovations. On one hand, we introduce three interdependent operations, namely weight partition, group-wise quantization and re-training. A well-proven measure is employed to divide the weights in each layer of a pre-trained CNN model into two disjoint groups. The weights in the first group are responsible to form a low-precision base, thus they are quantized by a variable-length encoding method. The weights in the other group are responsible to compensate for the accuracy loss from the quantization, thus they are the ones to be re-trained. On the other hand, these three operations are repeated on the latest re-trained group in an iterative manner until all the weights are converted into low-precision ones, acting as an incremental network quantization and accuracy enhancement procedure. Extensive experiments on the ImageNet classification task using almost all known deep CNN architectures including AlexNet, VGG-16, GoogleNet and ResNets well testify the efficacy of the proposed method. Specifically, at 5-bit quantization, our models have improved accuracy than the 32-bit floating-point references. Taking ResNet-18 as an example, we further show that our quantized models with 4-bit, 3-bit and 2-bit ternary weights have improved or very similar accuracy against its 32-bit floating-point baseline. Besides, impressive results with the combination of network pruning and INQ are also reported. The code is available at https://github.com/Zhouaojun/Incremental-Network-Quantization.

Pyramid Vector Quantization and Bit Level Sparsity in Weights for Efficient Neural Networks Inference

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Pyramid Vector Quantization for LLMs

SeerNet: Predicting Convolutional Neural Network Feature-Map Sparsity Through Low-Bit Quantization.

Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights

Focused Quantization for Sparse CNNs

Vector Quantization for Machine Vision

SQuantizer: Simultaneous Learning for Both Sparse and Low-precision Neural Networks

Learning Low Resource Consumption CNN through Pruning and Quantization

Compressing Deep Convolutional Networks using Vector Quantization

Deep Neural Network Compression With Single and Multiple Level Quantization

Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators

Transform Quantization for CNN (Convolutional Neural Network) Compression

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Post-Training Non-Uniform Quantization for Convolutional Neural Networks

DQI: A Dynamic Quantization Method for Efficient Convolutional Neural Network Inference Accelerators

Space Efficient Quantization for Deep Convolutional Neural Networks

Effective Interplay between Sparsity and Quantization: From Theory to Practice

Where Should We Begin? A Low-Level Exploration of Weight Initialization Impact on Quantized Behaviour of Deep Neural Networks

Structured Compression by Weight Encryption for Unstructured Pruning and Quantization

QIANets: Quantum-Integrated Adaptive Networks for Reduced Latency and Improved Inference Times in CNN Models