Abstract:Efficient model inference is an important and practical issue in the deployment of deep neural network on resource constraint platforms. Network quantization addresses this problem effectively by leveraging low-bit representation and arithmetic that could be conducted on dedicated embedded systems. In the previous works, the parameter bitwidth is set homogeneously and there is a trade-off between superior performance and aggressive compression. Actually the stacked network layers, which are generally regarded as hierarchical feature extractors, contribute diversely to the overall performance. For a well-trained neural network, the feature distributions of different categories differentiate gradually as the network propagates forward. Hence the capability requirement on the subsequent feature extractors is reduced. It indicates that the neurons in posterior layers could be assigned with lower bitwidth for quantized neural networks. Based on this observation, a simple but effective mixed-precision quantized neural network with progressively ecreasing bitwidth is proposed to improve the trade-off between accuracy and compression. Extensive experiments on typical network architectures and benchmark datasets demonstrate that the proposed method could achieve better or comparable results while reducing the memory space for quantized parameters by more than 30\% in comparison with the homogeneous counterparts. In addition, the results also demonstrate that the higher-precision bottom layers could boost the 1-bit network performance appreciably due to a better preservation of the original image information while the lower-precision posterior layers contribute to the regularization of $k-$bit networks.

DyBit: Dynamic Bit-Precision Numbers for Efficient Quantized Neural Network Inference

Bit Efficient Quantization for Deep Neural Networks

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

4.6-Bit Quantization for Fast and Accurate Neural Network Inference on CPUs

Mixed-Precision Quantized Neural Network with Progressively Decreasing Bitwidth For Image Classification and Object Detection.

Instance-Aware Dynamic Neural Network Quantization

VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference

Block-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics

Differentiable Soft Quantization: Bridging Full-Precision and Low-Bit Neural Networks

BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Structured Dynamic Precision for Deep Neural Networks Quantization

A Novel Low-Bit Quantization Strategy for Compressing Deep Neural Networks

Accelerating Neural Network Inference by Overflow Aware Quantization

Mixed-precision Deep Neural Network Quantization With Multiple Compression Rates

Towards Low-Bit Quantization of Deep Neural Networks with Limited Data.

HAWQV3: Dyadic Neural Network Quantization

Adaptive Quantization for Deep Neural Network

Multi-Precision Quantized Neural Networks Via Encoding Decomposition of {-1, +1}.

DP-Nets: Dynamic Programming Assisted Quantization Schemes for DNN Compression and Acceleration.

DRQ: Dynamic Region-based Quantization for Deep Neural Network Acceleration