Abstract:Post-training quantization (PTQ) is an effective compression method to reduce the model size and computational cost. However, quantizing a model into a low-bit one, e.g., lower than 4, is difficult and often results in non-negligible performance degradation. To address this, we investigate the loss landscapes of quantized networks with various bit-widths. We show that the network with more ragged loss surface, is more easily trapped into bad local minima, which mostly appears in low-bit quantization. A deeper analysis indicates, the ragged surface is caused by the injection of excessive quantization noise. To this end, we detach a sharpness term from the loss which reflects the impact of quantization noise. To smooth the rugged loss surface, we propose to limit the sharpness term small and stable during optimization. Instead of directly optimizing the target bit network, we design a self-adapted shrinking scheduler for the bit-width in continuous domain from high bit-width to the target by limiting the increasing sharpness term within a proper range. It can be viewed as iteratively adding small “instant” quantization noise and adjusting the network to eliminate its impact. Widely experiments including classification and detection tasks demonstrate the effectiveness of the Bit-shrinking strategy in PTQ. On the Vision Transformer models, our INT8 and INT6 models drop within 0.5% and 1.5% Top-1 accuracy, respectively. On the traditional CNN networks, our INT4 quantized models drop within 1.3% and 3.5% Top-1 accuracy on ResNet18 and MobileNetV2 without fine-tuning, which achieves the state-of-the-art performance.

HitNet: Hybrid Ternary Recurrent Neural Network

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Optimizing Quantized Neural Networks in a Weak Curvature Manifold

Propagating Asymptotic-Estimated Gradients for Low Bitwidth Quantized Neural Networks

Alternating Multi-bit Quantization for Recurrent Neural Networks

Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation

Effective Quantization Methods for Recurrent Neural Networks

Residual Quantization for Low Bit-Width Neural Networks.

Pruning Ternary Quantization

Ternary Quantization: A Survey

Quantization Networks

Weighted-Entropy-Based Quantization for Deep Neural Networks

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

Adaptive Binary-Ternary Quantization

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

HAWQV3: Dyadic Neural Network Quantization

Bag of Tricks with Quantized Convolutional Neural Networks for image classification

FATNN: Fast and Accurate Ternary Neural Networks