Abstract:Low-bit quantization is challenging to maintain high performance with limited model capacity (e.g., 4-bit for both weights and activations). Naturally, the distribution of both weights and activations in deep neural network are Gaussian-like. Nevertheless, due to the limited bitwidth of low-bit model, uniform-like distributed weights and activations have been proved to be more friendly to quantization while preserving accuracy~\cite{Han2015Learning}. Motivated by this, we propose Scale-Clip, a Distribution Reshaping technique that can reshape weights or activations into a uniform-like distribution in a dynamic manner. Furthermore, to increase the model capability for a low-bit model, a novel Group-based Quantization algorithm is proposed to split the filters into several groups. Different groups can learn different quantization parameters, which can be elegantly merged in to batch normalization layer without extra computational cost in the inference stage. Finally, we integrate Scale-Clip technique with Group-based Quantization algorithm and propose the Group-based Distribution Reshaping Quantization (GDQR) framework to further improve the quantization performance. Experiments on various networks (e.g. VGGNet and ResNet) and vision tasks (e.g. classification, detection and segmentation) demonstrate that our framework achieves good performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to maintain the high performance of deep neural network models under low - bit quantization (such as 4 - bit weights and activations). Specifically, the author points out that in low - bit models, due to the limited bit width, uniformly distributed weights and activations are more conducive to quantization than Gaussian distributions while maintaining accuracy. However, in the natural state, the weights and activations in deep neural networks mostly present Gaussian or Laplacian distributions, which leads to a large quantization loss. Therefore, the author proposes two main strategies to solve this problem: 1. **Distribution Reshaping (DR)**: By proposing a method called Scale - Clip, the distribution of weights and activations is dynamically reshaped to be close to a uniform distribution, thereby reducing the quantization loss and improving the performance of low - bit models. 2. **Group - based Quantization (GQ)**: The convolutional filters are divided into multiple groups, and each group can learn different quantization parameters. In this way, the expressive ability of low - bit models can be enhanced without increasing additional computational costs. Combining these two methods, the author proposes the **Group - based Distribution Reshaping Quantization (GDRQ)** framework, aiming to further improve the performance of low - bit quantization. The experimental results show that this framework has achieved better performance than existing methods on multiple networks (such as VGGNet and ResNet) and visual tasks (such as classification, detection, and segmentation). In particular, the accuracy of the ResNet - 50 model with 2 - bit weights and 4 - bit activations on the ImageNet classification task has dropped by less than 1%, which was the best - known result at that time.

GDRQ: Group-based Distribution Reshaping for Quantization

Low-bit Quantization Needs Good Distribution.

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Outlier-Aware Training for Low-Bit Quantization of Structural Re-Parameterized Networks

Distribution Matched Low-bit Post-Training Quantization for Convolutional Neural Networks

Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks.

Deep quantization generative networks

DPQ: dynamic pseudo-mean mixed-precision quantization for pruned neural network

Quantization Networks

Instance-Aware Dynamic Neural Network Quantization

Blended coarse gradient descent for full quantization of deep neural networks

Post-training quantization for re-parameterization via coarse & fine weight splitting

Distribution-aware Adaptive Multi-bit Quantization

Optimal Quantization for Batch Normalization in Neural Network Deployments and Beyond

EasyQuant: Post-training Quantization via Scale Optimization

GWQ: Gradient-Aware Weight Quantization for Large Language Models

GroupQ: Group-Wise Quantization With Multi-Objective Optimization for CNN Accelerators

Deep Learning with Low Precision by Half-Wave Gaussian Quantization

Fine-grained Data Distribution Alignment for Post-Training Quantization