Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems
Penghao Xiao,Chunjie Zhang,Qian Guo,Xiayang Xiao,Haipeng Wang
DOI: https://doi.org/10.1109/jstars.2024.3452321
IF: 4.715
2024-01-01
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Abstract:Deep neural networks (DNNs) have evolved to be the state-of-the-art technique for machine learning tasks. However, their high computational demands make them difficult to deploy on embedded devices with limited hardware resources and strict power budgets. Most embedded systems perform better with 8-bit data processing, prompting extensive research into 8-bit network quantization to enable faster inference. This article aims to propose a unified 8-bit inference and training framework to support object detection tasks, striving to balance accuracy and speed for conventional convolutional neural networks (CNNs). Initially, this article establishes a unified full int8 posttraining quantization (PTQ) method using KL_ divergence to evaluate the range of parameter distributions and thresholds before and after quantization. This method effectively addresses the quantization issues commonly found in networks with linear activations. For networks with nonlinear activations, this article introduces a hybrid precision posttraining quantization (H-PTQ) method that utilizes hybrid precision to perform forward inference, thereby mitigating quantization errors caused by nonlinear activation functions. Furthermore, quantization-aware training (QAT) typically employs the straightthrough estimator (STE) for backward propagation of gradients through the quantization function. However, since STE is an approximate computation, this article proposes an alternative called alpha-quantization-aware training (alpha-QAT). This method replaces the quantized weights in the loss function with affine combinations of the quantized and full-precision weights, enabling more precise forward and backward propagation to fine-tune the errors introduced by quantization. Finally, this paper conducted evaluations of quantized networks on ARM platforms and performed experiments across multiple datasets. The results indicate that the proposed PTQ, H-PTQ, and alpha-QAT methods achieved maximum accelerations of 4x, 2.3x, and 3.9x, respectively. In addition, these methods significantly reduced memory overhead by up to 57.11%, 43.16%, and 91.94%, and achieved model compression rates of up to 51.52%, 48.48%, and 49.70%.