Abstract:While Vision Transformers (ViTs) are extremely effective at computer vision tasks and are replacing convolutional neural networks as the new state-of-the-art, they are complex and memory-intensive models. In order to effectively run these models on resource-constrained mobile/edge systems, there is a need to not only compress these models but also to optimize them and convert them into deployment-friendly formats. To this end, this paper presents a combined pruning and quantization tool, called PQV-Mobile, to optimize vision transformers for mobile applications. The tool is able to support different types of structured pruning based on magnitude importance, Taylor importance, and Hessian importance. It also supports quantization from FP32 to FP16 and int8, targeting different mobile hardware backends. We demonstrate the capabilities of our tool and show important latency-memory-accuracy trade-offs for different amounts of pruning and int8 quantization with Facebook Data Efficient Image Transformer (DeiT) models. Our results show that even pruning a DeiT model by 9.375% and quantizing it to int8 from FP32 followed by optimizing for mobile applications, we find a latency reduction by 7.18X with a small accuracy loss of 2.24%. The tool is open source.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to optimize Vision Transformers (ViTs) to adapt to resource - constrained mobile/edge systems. Specifically, although ViTs perform excellently in computer vision tasks and are replacing Convolutional Neural Networks (CNNs) as the new mainstream model, they are complex and memory - intensive and difficult to be directly deployed on mobile devices. Therefore, it is necessary to compress, optimize these models and convert them into a lightweight format suitable for mobile applications. To achieve this goal, the author proposes a tool - PQV - Mobile - that combines pruning and quantization. This tool supports multiple structured pruning methods, including pruning based on magnitude importance, Taylor importance and Hessian importance. In addition, it also supports quantization from FP32 to FP16 and int8, and is optimized for different mobile hardware back - ends. ### Specific Problems and Solutions 1. **High model complexity and memory usage**: - **Solution**: Reduce the number of model parameters through pruning to lower the computational complexity and memory usage. - **Formula Explanation**: Pruning can be achieved by removing unimportant connections or parameters. For example, pruning based on magnitude importance can select the parameters to be pruned through the following formula: \[ w_i = \begin{cases} 0 & \text{if } |w_i| < \tau \\ w_i & \text{otherwise} \end{cases} \] where \( w_i \) is the model parameter and \( \tau \) is the threshold. 2. **Quantization precision loss**: - **Solution**: Convert floating - point numbers into low - precision integers through quantization, thereby reducing memory usage and computational cost. At the same time, the quantized model needs to be fine - tuned to maintain high precision. - **Formula Explanation**: The quantization process can be expressed as: \[ q_i = \text{round}\left(\frac{x_i - s}{\Delta}\right) \] where \( x_i \) is the original floating - point number, \( q_i \) is the quantized integer, \( s \) is the scaling factor, and \( \Delta \) is the quantization step size. 3. **Support for different hardware back - ends**: - **Solution**: PQV - Mobile supports multiple hardware back - ends (such as x86, FBGEMM, QNNPACK, ONEDNN), ensuring that the model can run efficiently on different types of mobile devices. ### Experimental Results The experimental results show that after using the PQV - Mobile tool to prune the DeiT model by 9.375% and quantize it to int8, the inference latency is reduced by 7.18 times, while the accuracy only drops by 2.24%. This indicates that PQV - Mobile can significantly improve the model inference speed while maintaining high accuracy, thus effectively solving the deployment problem of ViTs on mobile devices. ### Summary PQV - Mobile successfully optimizes Vision Transformers by combining pruning and quantization techniques, enabling them to better adapt to resource - constrained mobile/edge systems. This tool not only improves the model's inference speed, but also reduces memory usage and computational cost, providing strong support for the wide use of ViTs in mobile applications.

PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

PackQViT: Faster Sub-8-bit Vision Transformers Via Full and Packed Quantization on the Mobile.

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

Q-HyViT: Post-Training Quantization of Hybrid Vision Transformers with Bridge Block Reconstruction for IoT Systems

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

MPTQ-ViT: Mixed-Precision Post-Training Quantization for Vision Transformer

ViT-1.58b: Mobile Vision Transformers in the 1-bit Era

Model Quantization and Hardware Acceleration for Vision Transformers: A Comprehensive Survey

P2-ViT: Power-of-Two Post-Training Quantization and Acceleration for Fully Quantized Vision Transformer

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers

Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers

Towards Accurate Post-Training Quantization for Vision Transformer

PTQ4ViT: Post-training quantization for vision transformers with twin uniform quantization

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

PSAQ-ViT V2: Toward Accurate and General Data-Free Quantization for Vision Transformers