Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge

Nick John Eliopoulos,Purvish Jajal,James Davis,Gaowen Liu,George K. Thiravathukal,Yung-Hsiang Lu
2024-09-12
Abstract:This paper investigates how to efficiently deploy vision transformers on edge devices for small workloads. Recent methods reduce the latency of transformer neural networks by removing or merging tokens, with small accuracy degradation. However, these methods are not designed with edge device deployment in mind: they do not leverage information about the latency-workload trends to improve efficiency. We address this shortcoming in our work. First, we identify factors that affect ViT latency-workload relationships. Second, we determine token pruning schedule by leveraging non-linear latency-workload relationships. Third, we demonstrate a training-free, token pruning method utilizing this schedule. We show other methods may increase latency by 2-30%, while we reduce latency by 9-26%. For similar latency (within 5.2% or 7ms) across devices we achieve 78.6%-84.5% ImageNet1K accuracy, while the state-of-the-art, Token Merging, achieves 45.8%-85.4%.
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of efficiently deploying Vision Transformers (ViT) on edge devices. Specifically, the research focuses on the following points: 1. **Improving Efficiency**: Existing methods reduce the latency of transformer neural networks by removing or merging tokens, but these methods do not fully utilize hardware characteristics to further enhance efficiency. 2. **Hardware Awareness**: Existing methods do not fully consider hardware characteristics and the latency-workload relationship, leaving room for improvement. 3. **No Training Required**: Many existing methods require extensive training time, which is a barrier on edge devices. The method proposed in this paper does not require retraining of pre-trained models. ### Research Contributions The main contributions of the paper are as follows: 1. **Identifying Influencing Factors**: Identified and analyzed the factors affecting the ViT latency-workload relationship. 2. **Proposing a New Method**: Utilized the latency-workload relationship to determine the ViT token pruning strategy. 3. **Designing a Training-Free Pruning Mechanism**: Developed a new training-free token pruning mechanism that achieves higher accuracy under different hardware and workload sizes. Through these contributions, the paper demonstrates that its method achieves higher accuracy under similar latency conditions compared to existing technologies (such as Token Merging).