Training-Free Acceleration of ViTs with Delayed Spatial Merging

Jung Hwan Heo,Seyedarmin Azizi,Arash Fayyazi,Massoud Pedram
2024-07-01
Abstract:Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to enhance the untrained acceleration performance of Vision Transformer (ViT) models by improving token merging techniques. Specifically, the paper proposes a new inference framework called Delayed Spatial Merging (DSM), which primarily addresses the following issues: 1. **Attention Behavior Analysis**: - The study investigates the attention behavior in ViT and discovers a delayed convergence phenomenon, indicating that token merging is not suitable in the lower Transformer blocks. 2. **Hierarchical Processing Scheme**: - A hierarchical processing scheme is introduced to capture multi-scale redundancy among visual tokens. Combining the above aspects, DSM can significantly reduce FLOPs (by up to 1.8 times) and increase throughput (by up to 1.6 times) while maintaining minimal accuracy loss. Additionally, compared to existing methods, DSM can achieve these performance improvements without requiring retraining, thereby greatly simplifying the actual deployment process.