Abstract:We present CLAMP-ViT, a data-free post-training quantization method for vision transformers (ViTs). We identify the limitations of recent techniques, notably their inability to leverage meaningful inter-patch relationships, leading to the generation of simplistic and semantically vague data, impacting quantization accuracy. CLAMP-ViT employs a two-stage approach, cyclically adapting between data generation and model quantization. Specifically, we incorporate a patch-level contrastive learning scheme to generate richer, semantically meaningful data. Furthermore, we leverage contrastive learning in layer-wise evolutionary search for fixed- and mixed-precision quantization to identify optimal quantization parameters while mitigating the effects of a non-smooth loss landscape. Extensive evaluations across various vision tasks demonstrate the superiority of CLAMP-ViT, with performance improvements of up to 3% in top-1 accuracy for classification, 0.6 mAP for object detection, and 1.5 mIoU for segmentation at similar or better compression ratio over existing alternatives. Code is available at <a class="link-external link-https" href="https://github.com/georgia-tech-synergy-lab/CLAMP-ViT.git" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to generate high - quality synthetic data without real data during the post - training quantization process of Vision Transformers (ViTs) in order to improve the performance of the quantized model. Specifically, the existing data - free quantization methods (DFQ) have the following problems when generating synthetic data: 1. **Failure to capture meaningful inter - patch relationships**: Existing methods such as PSAQ - ViT v1 and v2 optimize Gaussian noise by maximizing global entropy. However, this method assumes that all patches are equally important and ignores spatial sensitivity, and thus may fail to capture semantically meaningful inter - patch relationships. 2. **Non - smoothness of the loss function**: The synthetic data generated by existing methods is likely to lead to non - smoothness of the loss function during the quantization process, resulting in sub - optimal parameter search and affecting the generalization ability of the model. 3. **Only support fixed - precision quantization**: Existing DFQ methods mainly support fixed - precision quantization and cannot flexibly perform mixed - precision quantization, which limits their application in different tasks. To overcome these problems, the paper proposes CLAMP - ViT (Contrastive Data - Free Learning for Adaptive Post - Training Quantization of ViTs), a new data - free quantization method that aims to generate semantically rich and spatially sensitive synthetic data and optimize quantization parameters through contrastive learning and evolutionary search strategies, thereby improving the performance of the quantized model. ### Main contributions: 1. **Generate semantically rich synthetic data**: CLAMP - ViT utilizes the architectural characteristics of ViT and the intrinsic properties of real images to generate semantically rich and spatially sensitive synthetic data through contrastive learning. 2. **Smooth the loss function**: Introduce local contrastive loss to capture the distribution differences of intermediate - layer outputs, smooth the loss function, and improve the convergence of parameter search and the generalization ability of the model. 3. **Support fixed and mixed - precision quantization**: CLAMP - ViT not only supports fixed - precision quantization but also supports mixed - precision quantization, and is suitable for a variety of visual tasks. ### Method overview: 1. **Stage 1: Synthetic data generation**: - Generate semantically rich synthetic data using contrastive learning. - For the "anchor patch" in the output of each attention head, select positive and negative samples, maximize the similarity between the anchor patch and the positive sample, and minimize the similarity with the negative sample. 2. **Stage 2: Quantization**: - Use an evolutionary search strategy, combined with local contrastive loss, to identify the optimal quantization parameters. - Through multiple rounds of iteration, continuously update the generated data and the quantized model to ensure that the generated data meets the requirements of the quantization process. ### Experimental results: - **Image classification**: On the ImageNet - 1K test set, CLAMP - ViT performs excellently in both fixed - precision quantization and mixed - precision quantization, significantly outperforming existing DFQ methods. - **Object detection**: On the COCO 2017 dataset, CLAMP - ViT achieves performance close to the baseline model in both fixed - precision and mixed - precision quantization. - **Semantic segmentation**: On the ADE20K dataset, CLAMP - ViT performs better than PSAQ - ViT v2 in fixed - precision quantization. In summary, CLAMP - ViT effectively improves the quantization performance of ViT in various visual tasks by generating high - quality synthetic data and optimizing quantization parameters.

CLAMP-ViT: Contrastive Data-Free Learning for Adaptive Post-Training Quantization of ViTs

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile.

Towards Accurate Post-Training Quantization for Vision Transformer

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers

Q-ViT: Fully Differentiable Quantization for Vision Transformer

PSAQ-ViT V2: Toward Accurate and General Data-Free Quantization for Vision Transformers

AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer

MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Mixed Non-linear Quantization for Vision Transformers

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

ViT-1.58b: Mobile Vision Transformers in the 1-bit Era

TSPTQ-ViT: Two-scaled post-training quantization for vision transformer

P2-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

Effective Vision Transformer Training: A Data-Centric Perspective

Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer

MimiQ: Low-Bit Data-Free Quantization of Vision Transformers with Encouraging Inter-Head Attention Similarity

I&S-ViT: An Inclusive & Stable Method for Pushing the Limit of Post-Training ViTs Quantization