Comprehensive Survey of Model Compression and Speed up for Vision Transformers

Feiyang Chen,Ziqian Luo,Lisang Zhou,Xueting Pan,Ying Jiang
2024-04-16
Abstract:Vision Transformers (ViT) have marked a paradigm shift in computer vision, outperforming state-of-the-art models across diverse tasks. However, their practical deployment is hampered by high computational and memory demands. This study addresses the challenge by evaluating four primary model compression techniques: quantization, low-rank approximation, knowledge distillation, and pruning. We methodically analyze and compare the efficacy of these techniques and their combinations in optimizing ViTs for resource-constrained environments. Our comprehensive experimental evaluation demonstrates that these methods facilitate a balanced compromise between model accuracy and computational efficiency, paving the way for wider application in edge computing devices.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem this paper attempts to address is the high computational and memory demands faced by Vision Transformers (ViT) in practical deployment. Although ViT performs excellently in various computer vision tasks, its large number of parameters leads to significant resource consumption, which limits its application in resource-constrained environments, such as edge computing devices. To this end, the paper evaluates four major model compression techniques: quantization, low-rank approximation, knowledge distillation, and pruning, and systematically analyzes and compares these techniques and their combinations in optimizing ViT to fit resource-constrained environments. The study aims to find a balance that improves computational efficiency while maintaining model accuracy, thereby promoting the widespread application of ViT in more scenarios.