PQV-Mobile: A Combined Pruning and Quantization Toolkit to Optimize Vision Transformers for Mobile Applications

Kshitij Bhardwaj
2024-08-16
Abstract:While Vision Transformers (ViTs) are extremely effective at computer vision tasks and are replacing convolutional neural networks as the new state-of-the-art, they are complex and memory-intensive models. In order to effectively run these models on resource-constrained mobile/edge systems, there is a need to not only compress these models but also to optimize them and convert them into deployment-friendly formats. To this end, this paper presents a combined pruning and quantization tool, called PQV-Mobile, to optimize vision transformers for mobile applications. The tool is able to support different types of structured pruning based on magnitude importance, Taylor importance, and Hessian importance. It also supports quantization from FP32 to FP16 and int8, targeting different mobile hardware backends. We demonstrate the capabilities of our tool and show important latency-memory-accuracy trade-offs for different amounts of pruning and int8 quantization with Facebook Data Efficient Image Transformer (DeiT) models. Our results show that even pruning a DeiT model by 9.375% and quantizing it to int8 from FP32 followed by optimizing for mobile applications, we find a latency reduction by 7.18X with a small accuracy loss of 2.24%. The tool is open source.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to optimize Vision Transformers (ViTs) to adapt to resource - constrained mobile/edge systems. Specifically, although ViTs perform excellently in computer vision tasks and are replacing Convolutional Neural Networks (CNNs) as the new mainstream model, they are complex and memory - intensive and difficult to be directly deployed on mobile devices. Therefore, it is necessary to compress, optimize these models and convert them into a lightweight format suitable for mobile applications. To achieve this goal, the author proposes a tool - PQV - Mobile - that combines pruning and quantization. This tool supports multiple structured pruning methods, including pruning based on magnitude importance, Taylor importance and Hessian importance. In addition, it also supports quantization from FP32 to FP16 and int8, and is optimized for different mobile hardware back - ends. ### Specific Problems and Solutions 1. **High model complexity and memory usage**: - **Solution**: Reduce the number of model parameters through pruning to lower the computational complexity and memory usage. - **Formula Explanation**: Pruning can be achieved by removing unimportant connections or parameters. For example, pruning based on magnitude importance can select the parameters to be pruned through the following formula: \[ w_i = \begin{cases} 0 & \text{if } |w_i| < \tau \\ w_i & \text{otherwise} \end{cases} \] where \( w_i \) is the model parameter and \( \tau \) is the threshold. 2. **Quantization precision loss**: - **Solution**: Convert floating - point numbers into low - precision integers through quantization, thereby reducing memory usage and computational cost. At the same time, the quantized model needs to be fine - tuned to maintain high precision. - **Formula Explanation**: The quantization process can be expressed as: \[ q_i = \text{round}\left(\frac{x_i - s}{\Delta}\right) \] where \( x_i \) is the original floating - point number, \( q_i \) is the quantized integer, \( s \) is the scaling factor, and \( \Delta \) is the quantization step size. 3. **Support for different hardware back - ends**: - **Solution**: PQV - Mobile supports multiple hardware back - ends (such as x86, FBGEMM, QNNPACK, ONEDNN), ensuring that the model can run efficiently on different types of mobile devices. ### Experimental Results The experimental results show that after using the PQV - Mobile tool to prune the DeiT model by 9.375% and quantize it to int8, the inference latency is reduced by 7.18 times, while the accuracy only drops by 2.24%. This indicates that PQV - Mobile can significantly improve the model inference speed while maintaining high accuracy, thus effectively solving the deployment problem of ViTs on mobile devices. ### Summary PQV - Mobile successfully optimizes Vision Transformers by combining pruning and quantization techniques, enabling them to better adapt to resource - constrained mobile/edge systems. This tool not only improves the model's inference speed, but also reduces memory usage and computational cost, providing strong support for the wide use of ViTs in mobile applications.