AE-Qdrop: Towards Accurate and Efficient Low-Bit Post-Training Quantization for A Convolutional Neural Network

Jixing Li,Gang Chen,Min Jin,Wenyu Mao,Huaxiang Lu
DOI: https://doi.org/10.3390/electronics13030644
IF: 2.9
2024-02-05
Electronics
Abstract:Blockwise reconstruction with adaptive rounding helps achieve acceptable 4-bit post-training quantization accuracy. However, adaptive rounding is time intensive, and the optimization space of weight elements is constrained to a binary set, thus limiting the performance of quantized models. The optimality of block-wise reconstruction requires that subsequent network blocks remain unquantized. To address this, we propose a two-stage post-training quantization scheme, AE-Qdrop, encompassing block-wise reconstruction and global fine-tuning. In the block-wise reconstruction stage, a progressive optimization strategy is introduced as a replacement for adaptive rounding, enhancing both quantization accuracy and efficiency. Additionally, the integration of randomly weighted quantized activation helps mitigate the risk of overfitting. In the global fine-tuning stage, the weights of each quantized network block are corrected simultaneously through logit matching and feature matching. Experiments in image classification and object detection tasks validate that AE-Qdrop achieves high precision and efficient quantization. For the 2-bit MobileNetV2, AE-Qdrop outperforms Qdrop in quantization accuracy by 6.26%, and its quantization efficiency is fivefold higher.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
This paper aims to solve the problems of precision loss and inefficiency encountered in the low - bit quantization process of convolutional neural networks (CNNs). Specifically, the paper points out that current quantization techniques such as Adaptive Rounding can improve quantization precision, but their optimization space is limited by the binary set, and a large number of iteration cycles are required to gradually weaken the regularization constraints, which significantly reduces the quantization efficiency. In addition, the Block - wise Reconstruction technique assumes that subsequent network blocks are not quantized, but in practical applications all network blocks need to be quantized, which may lead to sub - optimal quantization results for each network block. To overcome these problems, the paper proposes a two - stage post - training quantization scheme - AE - Qdrop, which includes two stages: block reconstruction and global fine - tuning. In the block reconstruction stage, the Progressive Optimization Strategy (POS) is introduced to replace Adaptive Rounding, which improves the quantization precision and efficiency, and the Randomly Weighted Quantized Activation (RWQA) is used to increase the diversity of activations, effectively improving the generalization performance of the quantization model. In the global fine - tuning stage, the weights of each quantized network block are corrected simultaneously through feature matching and logical matching, further improving the quantization precision. The main contributions of the paper are as follows: 1. A theoretical analysis of the limitations of Adaptive Rounding and Block - wise Reconstruction is carried out. 2. The AE - Qdrop algorithm is proposed, which combines the Progressive Optimization Strategy and Randomly Weighted Quantized Activation to improve the precision and efficiency of block reconstruction. Subsequently, the overall quantization precision is further optimized by global fine - tuning of the weights. 3. The quantization precision and efficiency of AE - Qdrop in mainstream networks are verified through extensive experiments, demonstrating its superior performance. For example, for MobileNetV2 with 2 - bit quantization, the quantization precision of AE - Qdrop is 6.26% higher than that of Qdrop, and the quantization efficiency is 5 times that of Qdrop. These improvements make AE - Qdrop an efficient and high - precision low - bit quantization method, which is suitable for resource - constrained mobile devices.