Abstract:The 8 bits quantization has been widely applied to accelerate network inference in various deep learning applications. There are two kinds of quantization methods, training-based quantization and post-training quantization. Training-based approach suffers from a cumbersome training process, while post-training quantization may lead to unacceptable accuracy drop. In this paper, we present an efficient and simple post-training method via scale optimization, named EasyQuant (EQ),that could obtain comparable accuracy with the training-based <a class="link-external link-http" href="http://method.Specifically" rel="external noopener nofollow">this http URL</a>, we first alternately optimize scales of weights and activations for all layers target at convolutional outputs to further obtain the high quantization precision. Then, we lower down bit width to INT7 both for weights and activations, and adopt INT16 intermediate storage and integer Winograd convolution implementation to accelerate <a class="link-external link-http" href="http://inference.Experimental" rel="external noopener nofollow">this http URL</a> results on various computer vision tasks show that EQ outperforms the TensorRT method and can achieve near INT8 accuracy in 7 bits width post-training.
Computer Vision and Pattern Recognition,Machine Learning,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use the Post - training Quantization method to maintain the high accuracy of the model while reducing the model's memory occupation and computational load when the deep - learning model is deployed to devices with limited computing resources. Specifically, the paper proposes a new post - training quantization method - EasyQuant (EQ), aiming to optimize the scale factors of weights and activations to reduce the precision loss during the quantization process. Compared with traditional training quantization methods, EasyQuant can achieve precision close to that of training quantization methods without additional training, especially performing excellently in 7 - bit - width quantization.
### Main contributions:
1. **Proposed a scale optimization method for post - training quantization**: By alternately searching for the scale factors of weights and activations, the cosine similarity between the original floating - point output and the quantized output is maximized, thereby obtaining precision comparable to that of training quantization methods.
2. **Implemented a more efficient INT7 quantization inference framework**: By adopting INT16 intermediate storage and integer Winograd convolution implementation, the inference speed is increased, which is especially suitable for the ARM platform.
3. **Extensive experimental verification**: Experiments were carried out on multiple computer vision tasks, including classification, detection, and recognition, proving the effectiveness and superiority of EasyQuant under different bit - width settings.
### Method overview:
- **Linear quantization formula**: Defined the linear quantization process \( Q(X, S) \), where \( X \) is a tensor and \( S \) is a scale factor, and the quantization result \( Q(X, S) \) belongs to the \( b \) - bit - width integer domain \( Z_b \).
- **Scale optimization**: Optimize the scale factors by maximizing the cosine similarity of the output feature maps of each layer, and use an alternating optimization method to adjust the scale factors of weights and activations respectively.
- **INT7 quantization inference**: By utilizing specific instructions of the ARM architecture (such as SMLAL and SADALP), higher computational efficiency and lower memory access are achieved in 7 - bit - width quantization.
### Experimental results:
- **INT8 quantization**: On the ImageNet2012 classification, VOC2007 object detection, and face recognition tasks, EasyQuant outperforms the TensorRT method on multiple convolutional neural network architectures.
- **INT7 quantization**: In 7 - bit - width quantization, EasyQuant not only maintains high precision but also shows lower inference latency on actual hardware (such as RK3399).
In conclusion, this paper solves the problem of efficiently deploying deep - learning models on devices with limited computing resources by proposing the EasyQuant method, especially making significant progress in post - training quantization.