Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation

Zechun Liu,Kwang-Ting Cheng,Dong Huang,Eric Xing,Zhiqiang Shen
DOI: https://doi.org/10.48550/arXiv.2111.14826
2022-04-07
Abstract:The nonuniform quantization strategy for compressing neural networks usually achieves better performance than its counterpart, i.e., uniform strategy, due to its superior representational capacity. However, many nonuniform quantization methods overlook the complicated projection process in implementing the nonuniformly quantized weights/activations, which incurs non-negligible time and space overhead in hardware deployment. In this study, we propose Nonuniform-to-Uniform Quantization (N2UQ), a method that can maintain the strong representation ability of nonuniform methods while being hardware-friendly and efficient as the uniform quantization for model inference. We achieve this through learning the flexible in-equidistant input thresholds to better fit the underlying distribution while quantizing these real-valued inputs into equidistant output levels. To train the quantized network with learnable input thresholds, we introduce a generalized straight-through estimator (G-STE) for intractable backward derivative calculation w.r.t. threshold parameters. Additionally, we consider entropy preserving regularization to further reduce information loss in weight quantization. Even under this adverse constraint of imposing uniformly quantized weights and activations, our N2UQ outperforms state-of-the-art nonuniform quantization methods by 0.5~1.7 on ImageNet, demonstrating the contribution of N2UQ design. Code and models are available at: <a class="link-external link-https" href="https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the trade - off between the hardware implementation efficiency and the model performance in the quantization methods in neural network compression. Specifically: 1. **Hardware implementation efficiency**: Nonuniform Quantization can usually achieve better performance than Uniform Quantization, but its complex projection process leads to significant time and space overheads during hardware deployment. For example, the output of non - uniform quantization is usually floating - point numbers, and these floating - point numbers need to be mapped to binary bits through Look - Up Tables (LUTs) to accelerate multiplication operations, which increases the hardware area and energy consumption. 2. **Model performance**: Although uniform quantization is more hardware - friendly, because its quantization levels with fixed intervals cannot well adapt to different distributions of input values, resulting in large quantization errors and affecting the model's accuracy. Especially in the case of low - bit quantization (such as 2 - bit), the model performance has a significant decline compared with the full - precision model. In order to improve both the hardware implementation efficiency and the model performance simultaneously, this paper proposes the **Nonuniform - to - Uniform Quantization (N2UQ)** method. The main goals of N2UQ are: - **Maintaining hardware - friendliness**: By outputting uniformly - spaced quantized values, the quantized weights and activations can directly perform efficient bitwise operations, avoiding additional post - processing steps. - **Improving quantization accuracy**: By learning the input thresholds to better fit the underlying data distribution, thereby reducing quantization errors and improving model performance. ### Main contributions 1. **Proposing N2UQ**: A new quantization method that improves quantization accuracy by learning input thresholds while maintaining hardware - friendliness similar to uniform quantization. 2. **Introducing the Generalized Straight - Through Estimator (G - STE)**: Solves the difficult problem of gradient calculation regarding input threshold parameters in the quantization process. G - STE can automatically adjust the thresholds and provide a more fine - grained approximation of the quantization function. 3. **Proposing weight regularization**: Based on entropy analysis, a new weight regularization method is proposed, which further reduces the information loss during the quantization process. 4. **Experimental verification**: Extensive experiments were carried out on the ImageNet dataset, and the results show that N2UQ significantly improves the model's accuracy under different architectures and bit - width constraints. In particular, on the 2 - bit ResNet - 50 model, it reaches a top - 1 accuracy of 76.4%, only 0.6% lower than the full - precision model, demonstrating the effectiveness of the N2UQ design. ### Formula representation - **Quantization output**: \[ x_q=\begin{cases} 0 & \text{if } x_r < T_1\\ 1 & \text{if } T_1\leq x_r < T_2\\ \vdots & \vdots\\ 2^{n - 1} & \text{if } x_r\geq T_{2^{n - 1}} \end{cases} \] - **Back - propagation of G - STE**: \[ \frac{\partial x_q}{\partial x_r}=E\left[\frac{\partial\tilde{x}_q}{\partial x_r}\right]=\frac{\partial}{\partial x_r}E[\tilde{x}_q]=\begin{cases} \frac{\partial}{\partial x_r}\left(\frac{x_r - d_{i - 1}}{a_i + i - 1}\right) & \text{if } d_{i - 1}\leq x_r < d_i\\ 0 & \text{otherwise} \end{cases} \] - **Weight regularization**: \[ \max H =-\sum_{i = 1}^N p_i\log(p_i)