Abstract:Recently, some researchers have begun to adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. As the Transformer captures the long-range relations between pixels well using the self-attention scheme, which complements the issues caused by the limited receptive field of CNN. Although their trackers work well in regular scenarios, they simply flatten the 2D features into a sequence to better match the Transformer. We believe these operations ignore the spatial prior of the target object, which may lead to sub-optimal results only. In addition, many works demonstrate that self-attention is actually a low-pass filter, which is independent of input features or keys/queries. That is to say, it may suppress the high-frequency component of the input features and preserve or even amplify the low-frequency information. To handle these issues, in this paper, we propose a unified Spatial-Frequency Transformer that models the Gaussian spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously. To be specific, Gaussian spatial prior is generated using dual Multi-Layer Perceptrons (MLPs) and injected into the similarity matrix produced by multiplying Query and Key features in self-attention. The output will be fed into a softmax layer and then decomposed into two components, i.e., the direct and high-frequency signal. The low- and high-pass branches are rescaled and combined to achieve all-pass, therefore, the high-frequency features will be protected well in stacked self-attention layers. We further integrate the Spatial-Frequency Transformer into the Siamese tracking framework and propose a novel tracking algorithm termed SFTransT. The cross-scale fusion based SwinTransformer is adopted as the backbone, and also a multi-head cross-attention module is used to boost the interaction between search and template features. The output will be fed into the tracking head for target localization. Extensive experiments on short-term and long-term tracking benchmarks all demonstrate the effectiveness of our proposed framework. Source code will be released at https://github.com/Tchuanm/SFTransT.git.

Learning Spatial-Channel Attention for Visual Tracking

Background-aware Siamese Network Tracking Based on Salient Feature Fusion

Learning reinforced attentional representation for end-to-end visual tracking

Exploiting spatial relationships for visual tracking

Robust visual tracking with channel attention and focal loss

Continuity-Discrimination Convolutional Neural Network for Visual Object Tracking

SCSTCF: Spatial-Channel Selection and Temporal Regularized Correlation Filters for visual tracking

Learning Spatial-Frequency Transformer for Visual Object Tracking

CMAT: Integrating Convolution Mixer and Self-Attention for Visual Tracking

UCT: Learning Unified Convolutional Networks for Real-time Visual Tracking

Collaboration of Spatial and Feature Attention for Visual Tracking

Dynamic memory network with spatial-temporal feature fusion for visual tracking

High Performance Visual Object Tracking with Unified Convolutional Networks

Learning Multidimensional Spatial Attention for Robust Nighttime Visual Tracking

Joint Correlation and Attention Based Feature Fusion Network for Accurate Visual Tracking

Learning background-aware and spatial-temporal regularized correlation filters for visual tracking

Siamese anchor-free object tracking with multiscale spatial attentions

Exploiting multi-scale hierarchical feature representation for visual tracking

Siamese Attentional Cascade Keypoints Network for Visual Object Tracking

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Target-Aware Deep Tracking