Abstract:The 8 bits quantization has been widely applied to accelerate network inference in various deep learning applications. There are two kinds of quantization methods, training-based quantization and post-training quantization. Training-based approach suffers from a cumbersome training process, while post-training quantization may lead to unacceptable accuracy drop. In this paper, we present an efficient and simple post-training method via scale optimization, named EasyQuant (EQ),that could obtain comparable accuracy with the training-based <a class="link-external link-http" href="http://method.Specifically" rel="external noopener nofollow">this http URL</a>, we first alternately optimize scales of weights and activations for all layers target at convolutional outputs to further obtain the high quantization precision. Then, we lower down bit width to INT7 both for weights and activations, and adopt INT16 intermediate storage and integer Winograd convolution implementation to accelerate <a class="link-external link-http" href="http://inference.Experimental" rel="external noopener nofollow">this http URL</a> results on various computer vision tasks show that EQ outperforms the TensorRT method and can achieve near INT8 accuracy in 7 bits width post-training.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use the Post - training Quantization method to maintain the high accuracy of the model while reducing the model's memory occupation and computational load when the deep - learning model is deployed to devices with limited computing resources. Specifically, the paper proposes a new post - training quantization method - EasyQuant (EQ), aiming to optimize the scale factors of weights and activations to reduce the precision loss during the quantization process. Compared with traditional training quantization methods, EasyQuant can achieve precision close to that of training quantization methods without additional training, especially performing excellently in 7 - bit - width quantization. ### Main contributions: 1. **Proposed a scale optimization method for post - training quantization**: By alternately searching for the scale factors of weights and activations, the cosine similarity between the original floating - point output and the quantized output is maximized, thereby obtaining precision comparable to that of training quantization methods. 2. **Implemented a more efficient INT7 quantization inference framework**: By adopting INT16 intermediate storage and integer Winograd convolution implementation, the inference speed is increased, which is especially suitable for the ARM platform. 3. **Extensive experimental verification**: Experiments were carried out on multiple computer vision tasks, including classification, detection, and recognition, proving the effectiveness and superiority of EasyQuant under different bit - width settings. ### Method overview: - **Linear quantization formula**: Defined the linear quantization process \( Q(X, S) \), where \( X \) is a tensor and \( S \) is a scale factor, and the quantization result \( Q(X, S) \) belongs to the \( b \) - bit - width integer domain \( Z_b \). - **Scale optimization**: Optimize the scale factors by maximizing the cosine similarity of the output feature maps of each layer, and use an alternating optimization method to adjust the scale factors of weights and activations respectively. - **INT7 quantization inference**: By utilizing specific instructions of the ARM architecture (such as SMLAL and SADALP), higher computational efficiency and lower memory access are achieved in 7 - bit - width quantization. ### Experimental results: - **INT8 quantization**: On the ImageNet2012 classification, VOC2007 object detection, and face recognition tasks, EasyQuant outperforms the TensorRT method on multiple convolutional neural network architectures. - **INT7 quantization**: In 7 - bit - width quantization, EasyQuant not only maintains high precision but also shows lower inference latency on actual hardware (such as RK3399). In conclusion, this paper solves the problem of efficiently deploying deep - learning models on devices with limited computing resources by proposing the EasyQuant method, especially making significant progress in post - training quantization.

EasyQuant: Post-training Quantization via Scale Optimization

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

PTQ-SO: A Scale Optimization-based Approach for Post-training Quantization of Edge Computing

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Towards Accurate and Efficient Sub-8-Bit Integer Training

Low-precision CNN Model Quantization based on Optimal Scaling Factor Estimation

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

Designing Quantizers for Low-Precision Post-Training Quantization: A Standard Pipeline Approach for CNNs

Optimization-based Post-training Quantization with Bit-split and Stitching

Two-Step Quantization for Low-bit Neural Networks

2DQuant: Low-bit Post-Training Quantization for Image Super-Resolution

Error-aware Quantization through Noise Tempering

Post-Training Non-Uniform Quantization for Convolutional Neural Networks

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Training High-Performance and Large-Scale Deep Neural Networks with Full 8-Bit Integers.

Post-training Quantization or Quantization-aware Training? That is the Question

MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search

PD-Quant: Post-Training Quantization Based on Prediction Difference Metric

Gradient Distribution-aware INT8 Training for Neural Networks