Abstract:Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years. However, their high computational costs remain a significant barrier to their practical deployment. In particular, the complexity of transformer models is quadratic with respect to the number of input tokens. Therefore techniques that reduce the number of input tokens that need to be processed have been proposed. This paper introduces Learned Thresholds token Merging and Pruning (LTMP), a novel approach that leverages the strengths of both token merging and token pruning. LTMP uses learned threshold masking modules that dynamically determine which tokens to merge and which to prune. We demonstrate our approach with extensive experiments on vision transformers on the ImageNet classification task. Our results demonstrate that LTMP achieves state-of-the-art accuracy across reduction rates while requiring only a single fine-tuning epoch, which is an order of magnitude faster than previous methods. Code is available at <a class="link-external link-https" href="https://github.com/Mxbonn/ltmp" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

The paper attempts to address the problem of reducing computational costs in Vision Transformers while maintaining high accuracy. Specifically, Vision Transformers have achieved significant success in various computer vision tasks, but their high computational cost has become a major obstacle for practical deployment. In particular, the complexity of transformer models is quadratically related to the number of input tokens. Therefore, the paper proposes a new method—Learned Thresholds Token Merging and Pruning (LTMP), which combines the advantages of token merging and token pruning. LTMP uses a learned threshold mask module to dynamically determine which tokens need to be merged and which need to be pruned. Experimental results show that LTMP achieves state-of-the-art accuracy at different reduction rates and requires only one fine-tuning cycle, which is an order of magnitude faster than previous methods. This allows LTMP to significantly reduce computational costs without sacrificing accuracy, thereby improving the applicability of Vision Transformers on resource-constrained devices.

Learned Thresholds Token Merging and Pruning for Vision Transformers