Abstract:Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE. Codes and models are available at <a class="link-external link-https" href="https://github.com/HuangOwen/Quantization-Variation" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of performance degradation and instability encountered in low - bit quantization training of Transformer models. Specifically, the authors focus on how to effectively compress the Transformer model to extremely low - bit precision (such as 2 - bit or binarization) for efficient deployment, especially on devices with limited computing resources. #### Main problems: 1. **Performance degradation in low - bit quantization**: - Although Transformer models perform well in natural language processing and computer vision tasks, their large number of parameters and high computational requirements make efficient deployment difficult. - Traditional quantization methods (such as post - training quantization, PTQ) do not work well when applied to Transformers, and usually can only achieve 8 - bit or 6 - bit quantization, with a significant performance degradation. 2. **Unique instability phenomena during quantization**: - Transformer models exhibit unique instabilities during the quantization process. These instabilities stem from the different sensitivities of modules to quantization, outliers in weight and activation distributions, and weight oscillation during training. - These instabilities make quantization - aware training (QAT) difficult to converge and the performance is not as expected. 3. **Improving quantization efficiency**: - Existing QAT methods are not only time - consuming, but also still cannot achieve the desired performance under extremely low - bit quantization. - Therefore, more effective quantization methods need to be explored to reduce training time and improve model performance. ### Main contributions of the paper: - **Revealing the complexity in Transformer quantization**: By analyzing changes at different levels (module sensitivity, distribution outliers, weight oscillation), the authors reveal the internal challenges in Transformer quantization. - **Proposing optimized solutions for changes**: Including module - dependent quantization schemes, regularization strategies for oscillation, and multi - crop knowledge distillation methods. - **Verifying the effectiveness and efficiency of the methods**: Through extensive experiments, the superior performance and efficiency of the proposed methods on multiple Transformer architectures (such as DeiT, Swin, SReT, BERT) are proven. ### Conclusion: This paper significantly improves the performance and training efficiency of low - bit quantized Transformer models by in - depth analysis of the unique behavior of Transformer models in low - bit quantization and proposing targeted optimization methods.

Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile.

Towards Accurate Post-Training Quantization for Vision Transformer

Quantformer: Learning Extremely Low-precision Vision Transformers

Bit-shrinking: Limiting Instantaneous Sharpness for Improving Post-training Quantization

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

FrameQuant: Flexible Low-Bit Quantization for Transformers

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

Oscillation-free Quantization for Low-bit Vision Transformers

Mixed Precision of Quantization of Transformer Language Models for Speech Recognition

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

RepQuant: Towards Accurate Post-Training Quantization of Large Transformer Models via Scale Reparameterization

MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Q-ViT: Fully Differentiable Quantization for Vision Transformer