Token Cropr: Faster ViTs for Quite a Few Tasks

Benjamin Bergner,Christoph Lippert,Aravindh Mahendran
2024-12-02
Abstract:The adoption of Vision Transformers (ViTs) in resource-constrained applications necessitates improvements in inference throughput. To this end several token pruning and merging approaches have been proposed that improve efficiency by successively reducing the number of tokens. However, it remains an open problem to design a token reduction method that is fast, maintains high performance, and is applicable to various vision tasks. In this work, we present a token pruner that uses auxiliary prediction heads that learn to select tokens end-to-end based on task relevance. These auxiliary heads can be removed after training, leading to throughput close to that of a random pruner. We evaluate our method on image classification, semantic segmentation, object detection, and instance segmentation, and show speedups of 1.5 to 4x with small drops in performance. As a best case, on the ADE20k semantic segmentation benchmark, we observe a 2x speedup relative to the no-pruning baseline, with a negligible performance penalty of 0.1 median mIoU across 5 seeds.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the inference throughput of Vision Transformers (ViTs) in resource - constrained applications. Specifically, the authors propose a method named Cross - attention pruning (Cropr), which aims to reduce the number of tokens while maintaining high performance by selecting task - relevant tokens end - to - end, thus achieving faster inference speed. This method is especially suitable for a variety of visual tasks, such as image classification, semantic segmentation, object detection, and instance segmentation. ### Core Problems of the Paper 1. **Improving Inference Efficiency**: How to improve the inference speed by reducing the number of tokens in ViTs without significantly sacrificing performance. 2. **Multi - task Applicability**: Design a token - reduction method that can be applied to multiple visual tasks, including image classification, semantic segmentation, object detection, and instance segmentation. 3. **Task - Relevance Evaluation**: How to accurately and efficiently evaluate the importance of each token for a specific task. ### Solutions - **Cropr Module**: Use auxiliary prediction heads to learn to select task - relevant tokens. These auxiliary heads can be removed after training, thus introducing almost no additional overhead during inference. - **Layer - by - Layer Pruning**: Gradually prune unimportant tokens in the intermediate layers of ViT, and retain the most relevant tokens to pass to deeper layers. - **Last Layer Fusion (LLF)**: In dense tasks (such as semantic segmentation), re - introduce the pruned tokens to recover the pruned information and ensure the accuracy of pixel - level predictions. ### Experimental Results - **Image Classification**: Experiments were carried out using the EVA - 02 - L model on the ImageNet - 1k dataset, and the results showed a 1.6 - 1.9 times speedup while maintaining high accuracy. - **Semantic Segmentation**: Experiments were carried out using the Segmenter model on the ADE20k dataset, and the results showed a 2 - fold speedup while maintaining high performance, with a performance drop of only 0.1 mIoU. ### Main Contributions - **Efficient Token Selection**: Through an end - to - end learning method, accurately select task - relevant tokens, reducing unnecessary calculations. - **Wide Applicability**: The Cropr method is not only applicable to image classification, but also successfully applied to multiple visual tasks such as semantic segmentation, object detection, and instance segmentation. - **Balance between Performance and Efficiency**: While maintaining relatively high performance, significantly improve the inference speed, which is suitable for resource - constrained scenarios. In conclusion, this paper effectively solves the problem of improving the ViTs inference throughput in resource - constrained applications by proposing the Cropr method, while maintaining high performance and being applicable to multiple visual tasks.