Abstract:Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged for their deployments to resource-limited devices. Although recent studies have successfully discretized a full-precision network, they still incur large quantization errors after training, thus giving rise to a significant performance gap between a full-precision network and its quantized counterpart. In this work, we propose a novel quantization method for neural networks, Cluster-Promoting Quantization (CPQ) that finds the optimal quantization grids while naturally encouraging the underlying full-precision weights to gather around those quantization grids cohesively during training. This property of CPQ is thanks to our two main ingredients that enable differentiable quantization: i) the use of the categorical distribution designed by a specific probabilistic parametrization in the forward pass and ii) our proposed multi-class straight-through estimator (STE) in the backward pass. Since our second component, multi-class STE, is intrinsically biased, we additionally propose a new bit-drop technique, DropBits, that revises the standard dropout regularization to randomly drop bits instead of neurons. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer by imposing an additional regularization on DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support a new hypothesis for quantization: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to reduce quantization loss during the neural network quantization process while maintaining performance at low bit - widths. Specifically, the paper proposes a new quantization method named Cluster - Promoting Quantization (CPQ), which aims to find the optimal quantization grid and naturally encourages full - precision weights to cluster around these quantization grids during the training process, thereby reducing quantization errors. In addition, to further improve performance, the paper also proposes a new bit - dropping technique called DropBits, which reduces the bias of the multi - class Straight - Through Estimator (multi - class STE) by randomly dropping bits instead of neurons. Finally, the paper explores methods for learning heterogeneous quantization levels to adapt to the optimal bit - widths of different layers, thereby achieving more efficient resource utilization. ### Main contributions of the paper: 1. **Cluster - Promoting Quantization (CPQ)**: - A new quantization method is proposed, which can not only find the optimal quantization grid but also promote full - precision weights to cluster around these grids at low bit - widths. - Through the combination of specific probability parameterization and multi - class Straight - Through Estimator (multi - class STE), better clustering effects and final performance are achieved. 2. **DropBits**: - A new bit - dropping technique is proposed, which reduces the bias of the multi - class Straight - Through Estimator by randomly dropping bits. - Inspired by Dropout, but applied to bit - dropping in the quantization process instead of neuron - dropping. 3. **Heterogeneous quantization**: - An additional regularization method is introduced, which allows learning different bit - widths for each layer or channel, thereby achieving more efficient resource utilization. - It is verified that learning heterogeneous quantization levels results in better network performance than training a network from scratch with the same fixed quantization level. ### Experimental results: - The paper has carried out extensive experiments on multiple benchmark datasets, including MNIST, CIFAR - 10 and ImageNet, verifying the effectiveness of CPQ and DropBits. - On ResNet - 18 and MobileNetV2, CPQ + DropBits achieved new state - of - the - art results when all layers were uniformly quantized. - The heterogeneous quantization method still achieved satisfactory results when using at most 4 bits for all layers, verifying the new quantization hypothesis. ### Formula presentation: - **Quantization probability calculation**: \[ \pi_i=\text{Sigmoid}\left(\frac{g_i+\frac{\alpha}{2}-x}{\sigma}\right)-\text{Sigmoid}\left(\frac{g_i-\frac{\alpha}{2}-x}{\sigma}\right) \] where \( g_i \) is the quantization grid, \( \alpha \) controls the grid spacing size, and \( \sigma \) is the standard deviation. - **Multi - class Straight - Through Estimator**: - Forward propagation: \[ y = \text{onehot}(\arg\max_i\pi_i) \] - Backward propagation: \[ \frac{\partial L}{\partial\pi_{i_{\text{max}}}}=\frac{\partial L}{\partial y_{i_{\text{max}}}}, \quad \frac{\partial L}{\partial\pi_i} = 0\quad\forall i\neq i_{\text{max}} \] - **DropBits' binary mask generation**: \[ U_k\sim\text{Uniform}(0, 1) \] \[ S_k=\text{Sigmoid}\left(\frac{\lo

Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

AE-Qdrop: Towards Accurate and Efficient Low-Bit Post-Training Quantization for A Convolutional Neural Network

Towards the Limit of Network Quantization

Quantization Networks

CSMPQ: Class Separability Based Mixed-Precision Quantization.

Residual Quantization for Low Bit-Width Neural Networks.

PD-Quant: Post-Training Quantization Based on Prediction Difference Metric

Mixed-Precision Quantized Neural Network with Progressively Decreasing Bitwidth For Image Classification and Object Detection.

Mixed-precision Deep Neural Network Quantization With Multiple Compression Rates

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data.

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

Low-bit Quantization Needs Good Distribution.

DPQ: dynamic pseudo-mean mixed-precision quantization for pruned neural network

Two-Step Quantization for Low-bit Neural Networks

Bit Efficient Quantization for Deep Neural Networks

Learning to Quantize Deep Networks by Optimizing Quantization Intervals with Task Loss

Deep Neural Network Compression With Single and Multiple Level Quantization

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks