Context-Aware Token Selection and Packing for Enhanced Vision Transformer

Tianyi Zhang,Baoxin Li,Jae-sun Seo,Yu Cap
2024-10-31
Abstract:In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the efficiency and performance problems faced by Vision Transformers (ViTs) when dealing with computer vision tasks. Specifically: 1. **Efficiency problems**: - The traditional self - attention mechanism will process all tokens (including those with large amounts of information and useless tokens), resulting in a waste of computing resources. - Although the existing sparse attention mechanisms reduce the amount of computation by pruning tokens, they lack context - awareness and usually only optimize the efficiency of the inference stage, failing to fully utilize the parallelization advantages of the GPU. 2. **Performance problems**: - Existing methods perform well in simple tasks (such as image classification), but their performance degrades in complex tasks (such as object detection) because these tasks require more abundant semantic information. - Methods for dynamically selecting tokens may lead to information loss, especially in the early stages. To solve these problems, the author proposes a new algorithm - Select and Pack Attention (SPA). The main features of SPA are as follows: - **Dynamically select tokens with large amounts of information**: Use a low - cost linear gating layer to select tokens with large amounts of information under the guidance of supervision labels. - **Package tokens to achieve parallel training**: Package the selected tokens into new batches so that different numbers of tokens can be trained and inferred in parallel on the GPU. - **Multi - scale supervision**: Guide the selection of tokens through multi - scale selection labels (for example, multi - scale supervision signals from object labels) to ensure the accuracy of selection, thereby improving the performance of complex tasks. Through these improvements, SPA not only improves the performance of ViT in various computer vision tasks (such as object detection, multi - label classification, and image classification), but also significantly reduces the computing cost. Experimental results show that SPA has achieved better performance than existing methods on multiple datasets and performs well when dealing with sparse data. ### Formula summary 1. **Selection and packing process**: \[ s=\text{Max}(f_{\theta_g}(r), \text{sup}) \] \[ r_g = \text{Sigmoid}(s)\odot r \] \[ r_p=\text{Gumbel - Softmax}(s)\odot r_g \] 2. **Loss function**: \[ L_{\text{select}}=-\sum_{\text{block}}(y\log s+(1 - y)\log(1 - s)) \] \[ L_{\text{SPT}}=L_{\text{task}}+\alpha L_{\text{select}} \] 3. **Complexity comparison**: \[ \Omega(\text{MSA})=B(4NC^2 + 2N^2C) \] \[ \Omega(\text{W - MSA})=B(4NC^2+2M^2NC) \] \[ \Omega(\text{SPA})=B(NC + NC^2)+B'(3LC^2+2L^2C) \] These formulas show the specific operations of SPA when selecting and packing tokens, as well as its computational complexity advantage over traditional methods.