Abstract:The authors propose a novel adaptive attention for tracking task that enhances features through spatial sparse attention mechanism with less than 1/4 of the computational complexity of multi‐head attention. Based on adaptive attention, the authors build an efficient transformer tracking framework. It can perform deep interaction between search and template features to activate target information and aggregate multi‐level interaction features to enhance the representation ability. The evaluation results on seven benchmarks show that our tracker achieves outstanding performance with a speed of 43 fps and significant advantages in hard circumstances. Recently, several trackers utilising Transformer architecture have shown significant performance improvement. However, the high computational cost of multi‐head attention, a core component in the Transformer, has limited real‐time running speed, which is crucial for tracking tasks. Additionally, the global mechanism of multi‐head attention makes it susceptible to distractors with similar semantic information to the target. To address these issues, the authors propose a novel adaptive attention that enhances features through the spatial sparse attention mechanism with less than 1/4 of the computational complexity of multi‐head attention. Our adaptive attention sets a perception range around each element in the feature map based on the target scale in the previous tracking result and adaptively searches for the information of interest. This allows the module to focus on the target region rather than background distractors. Based on adaptive attention, the authors build an efficient transformer tracking framework. It can perform deep interaction between search and template features to activate target information and aggregate multi‐level interaction features to enhance the representation ability. The evaluation results on seven benchmarks show that the authors' tracker achieves outstanding performance with a speed of 43 fps and significant advantages in hard circumstances.

ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention

Constituent Attention for Vision Transformers

Vision Transformer with Super Token Sampling

Visual tracking using transformer with a combination of convolution and attention

Adaptive sparse attention-based compact transformer for object tracking

ScalableViT: Rethinking the Context-Oriented Generalization of Vision Transformer.

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Vision Transformer with Sparse Scan Prior

AViTMP: A Tracking-Specific Transformer for Single-Branch Visual Tracking

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Efficient transformer tracking with adaptive attention

How Does Attention Work in Vision Transformers? A Visual Analytics Attempt

Learning Cross-Attention Discriminators via Alternating Time–Space Transformers for Visual Tracking

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

VSA: Learning Varied-Size Window Attention in Vision Transformers

DctViT: Discrete Cosine Transform Meet Vision Transformers

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

RegionViT: Regional-to-Local Attention for Vision Transformers

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

Lite Vision Transformer with Enhanced Self-Attention