Improving Neural Network Efficiency Via Post-training Quantization with Adaptive Floating-Point

Fangxin Liu,Wenbo Zhao,Zhezhi He,Yanzhi Wang,Zongwu Wang,Changzhi Dai,Xiaoyao Liang,Li Jiang
DOI: https://doi.org/10.1109/iccv48922.2021.00523
2021-01-01
Abstract:Model quantization has emerged as a mandatory technique for efficient inference with advanced Deep Neural Networks (DNN) by representing model parameters with fewer bits. Nevertheless, prior model quantization either suffers from the inefficient data encoding method thus leading to noncompetitive model compression rate, or requires time-consuming quantization aware training process. In this work, we propose a novel Adaptive Floating-Point (AFP) as a variant of standard IEEE-754 floating-point format, with flexible configuration of exponent and mantissa segments. Leveraging the AFP for model quantization (i.e., encoding the parameter) could significantly enhance the model compression rate without accuracy degradation and model re-training. We also want to highlight that our proposed AFP could effectively eliminate the computationally intensive de-quantization step existing in the dynamic quantization technique adopted by the famous machine learning frameworks (e.g., pytorch, tensorRT, etc.). Moreover, we develop a framework to automatically optimize and choose the adequate AFP configuration for each layer, thus maximizing the compression efficacy. Our experiments indicate that AFP-encoded ResNet-50/MobileNet-v2 only has ∼0.04/0.6% accuracy degradation w.r.t its full-precision counterpart. It outperforms the state-of-the-art works by 1.1% in accuracy using the same bit-width while reducing the energy consumption by 11.2×, which is quite impressive for inference. Code is released at: https://github.com/MXHX7199/ICCV_2021_AFP
What problem does this paper attempt to address?