Abstract:The authors propose a novel adaptive attention for tracking task that enhances features through spatial sparse attention mechanism with less than 1/4 of the computational complexity of multi‐head attention. Based on adaptive attention, the authors build an efficient transformer tracking framework. It can perform deep interaction between search and template features to activate target information and aggregate multi‐level interaction features to enhance the representation ability. The evaluation results on seven benchmarks show that our tracker achieves outstanding performance with a speed of 43 fps and significant advantages in hard circumstances. Recently, several trackers utilising Transformer architecture have shown significant performance improvement. However, the high computational cost of multi‐head attention, a core component in the Transformer, has limited real‐time running speed, which is crucial for tracking tasks. Additionally, the global mechanism of multi‐head attention makes it susceptible to distractors with similar semantic information to the target. To address these issues, the authors propose a novel adaptive attention that enhances features through the spatial sparse attention mechanism with less than 1/4 of the computational complexity of multi‐head attention. Our adaptive attention sets a perception range around each element in the feature map based on the target scale in the previous tracking result and adaptively searches for the information of interest. This allows the module to focus on the target region rather than background distractors. Based on adaptive attention, the authors build an efficient transformer tracking framework. It can perform deep interaction between search and template features to activate target information and aggregate multi‐level interaction features to enhance the representation ability. The evaluation results on seven benchmarks show that the authors' tracker achieves outstanding performance with a speed of 43 fps and significant advantages in hard circumstances.

Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking

Efficient Visual Tracking via Hierarchical Cross-Attention Transformer

LiteTrack: Layer Pruning with Asynchronous Feature Extraction for Lightweight and Efficient Visual Tracking

Target-aware transformer tracking with hard occlusion instance generation

Exploring Dynamic Transformer for Efficient Object Tracking

Scaling-Invariant Max-Filtering Enhancement Transformers for Efficient Visual Tracking

Siamese Hierarchical Feature Fusion Transformer for Efficient Tracking

Exploiting Temporal Coherence for Self-Supervised Visual Tracking by Using Vision Transformer

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention

Compact Transformer Tracker with Correlative Masked Modeling.

VTT: Long-term Visual Tracking with Transformers

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Mobile Vision Transformer-based Visual Object Tracking

Propagating Prior Information with Transformer for Robust Visual Object Tracking

AViTMP: A Tracking-Specific Transformer for Single-Branch Visual Tracking

Highly Compact Adaptive Network Based on Transformer for RGBT Tracking

Efficient transformer tracking with adaptive attention

RTSformer: A Robust Toroidal Transformer With Spatiotemporal Features for Visual Tracking

LGTrack: Exploiting Local and Global Properties for Robust Visual Tracking

Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking