Abstract:The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the inference throughput of Vision Transformers (ViTs) in resource - constrained applications. Specifically, the authors propose a method named Cross - attention pruning (Cropr), which aims to reduce the number of tokens while maintaining high performance by selecting task - relevant tokens end - to - end, thus achieving faster inference speed. This method is especially suitable for a variety of visual tasks, such as image classification, semantic segmentation, object detection, and instance segmentation. ### Core Problems of the Paper 1. **Improving Inference Efficiency**: How to improve the inference speed by reducing the number of tokens in ViTs without significantly sacrificing performance. 2. **Multi - task Applicability**: Design a token - reduction method that can be applied to multiple visual tasks, including image classification, semantic segmentation, object detection, and instance segmentation. 3. **Task - Relevance Evaluation**: How to accurately and efficiently evaluate the importance of each token for a specific task. ### Solutions - **Cropr Module**: Use auxiliary prediction heads to learn to select task - relevant tokens. These auxiliary heads can be removed after training, thus introducing almost no additional overhead during inference. - **Layer - by - Layer Pruning**: Gradually prune unimportant tokens in the intermediate layers of ViT, and retain the most relevant tokens to pass to deeper layers. - **Last Layer Fusion (LLF)**: In dense tasks (such as semantic segmentation), re - introduce the pruned tokens to recover the pruned information and ensure the accuracy of pixel - level predictions. ### Experimental Results - **Image Classification**: Experiments were carried out using the EVA - 02 - L model on the ImageNet - 1k dataset, and the results showed a 1.6 - 1.9 times speedup while maintaining high accuracy. - **Semantic Segmentation**: Experiments were carried out using the Segmenter model on the ADE20k dataset, and the results showed a 2 - fold speedup while maintaining high performance, with a performance drop of only 0.1 mIoU. ### Main Contributions - **Efficient Token Selection**: Through an end - to - end learning method, accurately select task - relevant tokens, reducing unnecessary calculations. - **Wide Applicability**: The Cropr method is not only applicable to image classification, but also successfully applied to multiple visual tasks such as semantic segmentation, object detection, and instance segmentation. - **Balance between Performance and Efficiency**: While maintaining relatively high performance, significantly improve the inference speed, which is suitable for resource - constrained scenarios. In conclusion, this paper effectively solves the problem of improving the ViTs inference throughput in resource - constrained applications by proposing the Cropr method, while maintaining high performance and being applicable to multiple visual tasks.

Token Cropr: Faster ViTs for Quite a Few Tasks

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Token Pruning using a Lightweight Background Aware Vision Transformer

Which Tokens to Use? Investigating Token Reduction in Vision Transformers

Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

Efficient Vision Transformer via Token Merger

An Attention-Based Token Pruning Method for Vision Transformers

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation

Making Vision Transformers Efficient from A Token Sparsification View

Exploring Token Pruning in Vision State Space Models

PPT: Token Pruning and Pooling for Efficient Vision Transformers

Multi-Scale And Token Mergence: Make Your ViT More Efficient

SPViT: Enabling Faster Vision Transformers Via Latency-Aware Soft Token Pruning

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Dynamic Token-Pass Transformers for Semantic Segmentation

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers