Abstract:In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.

What problem does this paper attempt to address?

This paper attempts to solve the efficiency and performance problems faced by Vision Transformers (ViTs) when dealing with computer vision tasks. Specifically: 1. **Efficiency problems**: - The traditional self - attention mechanism will process all tokens (including those with large amounts of information and useless tokens), resulting in a waste of computing resources. - Although the existing sparse attention mechanisms reduce the amount of computation by pruning tokens, they lack context - awareness and usually only optimize the efficiency of the inference stage, failing to fully utilize the parallelization advantages of the GPU. 2. **Performance problems**: - Existing methods perform well in simple tasks (such as image classification), but their performance degrades in complex tasks (such as object detection) because these tasks require more abundant semantic information. - Methods for dynamically selecting tokens may lead to information loss, especially in the early stages. To solve these problems, the author proposes a new algorithm - Select and Pack Attention (SPA). The main features of SPA are as follows: - **Dynamically select tokens with large amounts of information**: Use a low - cost linear gating layer to select tokens with large amounts of information under the guidance of supervision labels. - **Package tokens to achieve parallel training**: Package the selected tokens into new batches so that different numbers of tokens can be trained and inferred in parallel on the GPU. - **Multi - scale supervision**: Guide the selection of tokens through multi - scale selection labels (for example, multi - scale supervision signals from object labels) to ensure the accuracy of selection, thereby improving the performance of complex tasks. Through these improvements, SPA not only improves the performance of ViT in various computer vision tasks (such as object detection, multi - label classification, and image classification), but also significantly reduces the computing cost. Experimental results show that SPA has achieved better performance than existing methods on multiple datasets and performs well when dealing with sparse data. ### Formula summary 1. **Selection and packing process**: \[ s=\text{Max}(f_{\theta_g}(r), \text{sup}) \] \[ r_g = \text{Sigmoid}(s)\odot r \] \[ r_p=\text{Gumbel - Softmax}(s)\odot r_g \] 2. **Loss function**: \[ L_{\text{select}}=-\sum_{\text{block}}(y\log s+(1 - y)\log(1 - s)) \] \[ L_{\text{SPT}}=L_{\text{task}}+\alpha L_{\text{select}} \] 3. **Complexity comparison**: \[ \Omega(\text{MSA})=B(4NC^2 + 2N^2C) \] \[ \Omega(\text{W - MSA})=B(4NC^2+2M^2NC) \] \[ \Omega(\text{SPA})=B(NC + NC^2)+B'(3LC^2+2L^2C) \] These formulas show the specific operations of SPA when selecting and packing tokens, as well as its computational complexity advantage over traditional methods.

Context-Aware Token Selection and Packing for Enhanced Vision Transformer

Vision Transformer with Super Token Sampling

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention

ToSA: Token Selective Attention for Efficient Vision Transformers

Vision Transformer with Sparse Scan Prior

An Attention-Based Token Pruning Method for Vision Transformers

SPViT: Enabling Faster Vision Transformers Via Latency-Aware Soft Token Pruning

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Expediting Large-Scale Vision Transformer for Dense Prediction without Fine-tuning

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer.

LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation

A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images

PPT: Token Pruning and Pooling for Efficient Vision Transformers

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Token Selection is a Simple Booster for Vision Transformers

Robustifying Token Attention for Vision Transformers

Making Vision Transformers Efficient from A Token Sparsification View

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

Efficient Vision Transformer for Human-Centric AIoT Applications Through Token Tracking Assignment