Abstract:Network quantization, which aims to reduce the bit-lengths of the network weights and activations, has emerged for their deployments to resource-limited devices. Although recent studies have successfully discretized a full-precision network, they still incur large quantization errors after training, thus giving rise to a significant performance gap between a full-precision network and its quantized counterpart. In this work, we propose a novel quantization method for neural networks, Cluster-Promoting Quantization (CPQ) that finds the optimal quantization grids while naturally encouraging the underlying full-precision weights to gather around those quantization grids cohesively during training. This property of CPQ is thanks to our two main ingredients that enable differentiable quantization: i) the use of the categorical distribution designed by a specific probabilistic parametrization in the forward pass and ii) our proposed multi-class straight-through estimator (STE) in the backward pass. Since our second component, multi-class STE, is intrinsically biased, we additionally propose a new bit-drop technique, DropBits, that revises the standard dropout regularization to randomly drop bits instead of neurons. As a natural extension of DropBits, we further introduce the way of learning heterogeneous quantization levels to find proper bit-length for each layer by imposing an additional regularization on DropBits. We experimentally validate our method on various benchmark datasets and network architectures, and also support a new hypothesis for quantization: learning heterogeneous quantization levels outperforms the case using the same but fixed quantization levels from scratch.

Low-bit Quantization Needs Good Distribution.

GDRQ: Group-based Distribution Reshaping for Quantization

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Finding the Task-Optimal Low-Bit Sub-Distribution in Deep Neural Networks.

Distribution-aware Adaptive Multi-bit Quantization

Quantization Networks

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

Optimal Quantization for Batch Normalization in Neural Network Deployments and Beyond

Cluster-Promoting Quantization with Bit-Drop for Minimizing Network Quantization Loss

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

One Model for All Quantization: A Quantized Network Supporting Hot-Swap Bit-Width Adjustment

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks

Bit Efficient Quantization for Deep Neural Networks

Outlier-Aware Training for Low-Bit Quantization of Structural Re-Parameterized Networks

Instance-Aware Dynamic Neural Network Quantization

Searching for Low-Bit Weights in Quantized Neural Networks

Residual Quantization for Low Bit-Width Neural Networks.

Low-precision CNN Model Quantization based on Optimal Scaling Factor Estimation

A 4-Bit Integer-Only Neural Network Quantization Method Based on Shift Batch Normalization

DPQ: dynamic pseudo-mean mixed-precision quantization for pruned neural network