Post-training Quantization or Quantization-aware Training? That is the Question

Xinfei Guo,Xiaotian Zhao,Ruge Xu
DOI: https://doi.org/10.1109/CSTIC58779.2023.10219214
2023-06-26
Abstract:Quantization has been demonstrated to be one of the most effective model compression solutions that can potentially be adapted to support large models on a resource-constrained edge device while maintaining a minimal power budget. There are two forms of quantization: post-training quantization (PTQ) and quantization-aware training (QAT). The former starts from a trained model with floating-point computation and then gets quantized afterward, while the latter compensates for the quantization-related errors by training the neural network using the quantized version in the forward pass during training. Though QAT is able to produce accuracy benefits, it suffers from a long training process and less flexibility during deployment. Traditionally, researchers usually make the one-time bold decision between QAT and PTQ depending on the quantized bit-width and hardware requirement. In this work, we observed that even though the hardware cost is approximately the same for various quantization schemes, the sensitivity to training for each quantized layer is different. This leads to that certain scheme requires QAT more than others. We argue that it is necessary to look into this dimension by measuring the accuracy difference for each layer under QAT and PTQ conditions. In this paper, we introduce a methodology to provide a systematic and explainable way to quantity the tradeoffs between the quantization forms. This is especially beneficial for evaluating a layer-wise mixed-precision quantization (MPQ) scheme, where different bit-widths across are allowed and the search space is enormous.
Computer Science
What problem does this paper attempt to address?