Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge

Nick John Eliopoulos,Purvish Jajal,James Davis,Gaowen Liu,George K. Thiravathukal,Yung-Hsiang Lu

2024-09-12

Abstract:This paper investigates how to efficiently deploy vision transformers on edge devices for small workloads. Recent methods reduce the latency of transformer neural networks by removing or merging tokens, with small accuracy degradation. However, these methods are not designed with edge device deployment in mind: they do not leverage information about the latency-workload trends to improve efficiency. We address this shortcoming in our work. First, we identify factors that affect ViT latency-workload relationships. Second, we determine token pruning schedule by leveraging non-linear latency-workload relationships. Third, we demonstrate a training-free, token pruning method utilizing this schedule. We show other methods may increase latency by 2-30%, while we reduce latency by 9-26%. For similar latency (within 5.2% or 7ms) across devices we achieve 78.6%-84.5% ImageNet1K accuracy, while the state-of-the-art, Token Merging, achieves 45.8%-85.4%.

Machine Learning,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of efficiently deploying Vision Transformers (ViT) on edge devices. Specifically, the research focuses on the following points: 1. **Improving Efficiency**: Existing methods reduce the latency of transformer neural networks by removing or merging tokens, but these methods do not fully utilize hardware characteristics to further enhance efficiency. 2. **Hardware Awareness**: Existing methods do not fully consider hardware characteristics and the latency-workload relationship, leaving room for improvement. 3. **No Training Required**: Many existing methods require extensive training time, which is a barrier on edge devices. The method proposed in this paper does not require retraining of pre-trained models. ### Research Contributions The main contributions of the paper are as follows: 1. **Identifying Influencing Factors**: Identified and analyzed the factors affecting the ViT latency-workload relationship. 2. **Proposing a New Method**: Utilized the latency-workload relationship to determine the ViT token pruning strategy. 3. **Designing a Training-Free Pruning Mechanism**: Developed a new training-free token pruning mechanism that achieves higher accuracy under different hardware and workload sizes. Through these contributions, the paper demonstrates that its method achieves higher accuracy under similar latency conditions compared to existing technologies (such as Token Merging).

Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge

Single-shot Pruning and Quantization for Hardware-Friendly Neural Network Acceleration

Token Cropr: Faster ViTs for Quite a Few Tasks

SPViT: Enabling Faster Vision Transformers Via Latency-Aware Soft Token Pruning

An Attention-Based Token Pruning Method for Vision Transformers

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Learned Thresholds Token Merging and Pruning for Vision Transformers

Token Pruning using a Lightweight Background Aware Vision Transformer

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

LPViT: Low-Power Semi-structured Pruning for Vision Transformers

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Attention Map Guided Transformer Pruning for Edge Device

Learned Token Pruning for Transformers

[Aggressive fibromatosis in childhood].

Width & Depth Pruning for Vision Transformers

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers