Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Xijie Huang,Zhiqiang Shen,Pingcheng Dong,Kwang-Ting Cheng
2024-10-13
Abstract:Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at <a class="link-external link-https" href="https://github.com/HuangOwen/Quantization-Variation" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problems of performance degradation and instability encountered in low - bit quantization training of Transformer models. Specifically, the authors focus on how to effectively compress the Transformer model to extremely low - bit precision (such as 2 - bit or binarization) for efficient deployment, especially on devices with limited computing resources. #### Main problems: 1. **Performance degradation in low - bit quantization**: - Although Transformer models perform well in natural language processing and computer vision tasks, their large number of parameters and high computational requirements make efficient deployment difficult. - Traditional quantization methods (such as post - training quantization, PTQ) do not work well when applied to Transformers, and usually can only achieve 8 - bit or 6 - bit quantization, with a significant performance degradation. 2. **Unique instability phenomena during quantization**: - Transformer models exhibit unique instabilities during the quantization process. These instabilities stem from the different sensitivities of modules to quantization, outliers in weight and activation distributions, and weight oscillation during training. - These instabilities make quantization - aware training (QAT) difficult to converge and the performance is not as expected. 3. **Improving quantization efficiency**: - Existing QAT methods are not only time - consuming, but also still cannot achieve the desired performance under extremely low - bit quantization. - Therefore, more effective quantization methods need to be explored to reduce training time and improve model performance. ### Main contributions of the paper: - **Revealing the complexity in Transformer quantization**: By analyzing changes at different levels (module sensitivity, distribution outliers, weight oscillation), the authors reveal the internal challenges in Transformer quantization. - **Proposing optimized solutions for changes**: Including module - dependent quantization schemes, regularization strategies for oscillation, and multi - crop knowledge distillation methods. - **Verifying the effectiveness and efficiency of the methods**: Through extensive experiments, the superior performance and efficiency of the proposed methods on multiple Transformer architectures (such as DeiT, Swin, SReT, BERT) are proven. ### Conclusion: This paper significantly improves the performance and training efficiency of low - bit quantized Transformer models by in - depth analysis of the unique behavior of Transformer models in low - bit quantization and proposing targeted optimization methods.