MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity

Kanghyun Choi,Hye Yoon Lee,Dain Kwon,SunJong Park,Kyuyeun Kim,Noseong Park,Jinho Lee

2024-08-02

Abstract:Data-free quantization (DFQ) is a technique that creates a lightweight network from its full-precision counterpart without the original training data, often through a synthetic dataset. Although several DFQ methods have been proposed for vision transformer (ViT) architectures, they fail to achieve efficacy in low-bit settings. Examining the existing methods, we identify that their synthetic data produce misaligned attention maps, while those of the real samples are highly aligned. From the observation of aligned attention, we find that aligning attention maps of synthetic data helps to improve the overall performance of quantized ViTs. Motivated by this finding, we devise MimiQ, a novel DFQ method designed for ViTs that focuses on inter-head attention similarity. First, we generate synthetic data by aligning head-wise attention responses in relation to spatial query patches. Then, we apply head-wise structural attention distillation to align the attention maps of the quantized network to those of the full-precision teacher. The experimental results show that the proposed method significantly outperforms baselines, setting a new state-of-the-art performance for data-free ViT quantization.

Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the performance degradation problem of Vision Transformer (ViT) in low - bit Data - Free Quantization (DFQ). Specifically, when existing DFQ methods are applied to the ViT architecture, they will encounter severe accuracy loss in low - bit quantization settings. Through observation, the author found that the synthetic data generated by existing methods will lead to inconsistent attention maps, while the attention maps of real data are highly aligned. Based on this observation, the author proposed a new DFQ method - MimiQ, which focuses on the similarity between multi - head attention mechanisms to improve the performance of ViT in low - bit quantization. #### Main problem summary: 1. **Poor performance of existing DFQ methods in low - bit quantization**: When existing DFQ methods are applied to the ViT architecture, they will cause significant accuracy loss in low - bit quantization settings. 2. **Quality problems in synthetic data generation**: The synthetic data generated by existing methods leads to inconsistent attention maps, affecting the performance of the quantized model. 3. **Lack of effective ViT - specific DFQ methods**: Most existing DFQ methods are designed for CNNs and perform poorly when directly applied to ViT. ### Main contributions of MimiQ: - **Discovering the importance of attention map consistency**: The author found that the consistency of attention maps of synthetic data can significantly improve the effect of data - free quantization, especially in low - bit quantization settings. - **Proposing a synthetic data generation method based on attention map similarity**: Generate higher - quality synthetic data by minimizing the distance between different attention heads. - **Introducing a fine - grained attention distillation method**: In the fine - tuning stage of the quantized network, use a structured attention head distillation method to make the quantized network closer to the output of the full - precision model. - **Extensive experimental verification**: Through experiments on multiple tasks, ViT architectures and quantization settings, it is proved that MimiQ has superior performance in low - bit quantization. Through these improvements, MimiQ not only significantly outperforms the baseline methods in low - bit quantization, but also in some cases even exceeds the results of quantization fine - tuning using real data.

MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile.

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

Q-ViT: Fully Differentiable Quantization for Vision Transformer

ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

Hessian-based Mixed-Precision Quantization with Transition Aware Training for Neural Networks

PSAQ-ViT V2: Toward Accurate and General Data-Free Quantization for Vision Transformers

Quantformer: Learning Extremely Low-precision Vision Transformers

PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers

Towards Accurate Post-Training Quantization for Vision Transformer

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Oscillation-free Quantization for Low-bit Vision Transformers

Quantized Feature Distillation for Network Quantization

Patch-wise Mixed-Precision Quantization of Vision Transformer

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Vision Transformer With Quadrangle Attention

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

Mixed Non-linear Quantization for Vision Transformers

DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers

Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers