Abstract:Recently, some researchers have begun to adopt the Transformer to combine or replace the widely used ResNet as their new backbone network. As the Transformer captures the long-range relations between pixels well using the self-attention scheme, which complements the issues caused by the limited receptive field of CNN. Although their trackers work well in regular scenarios, they simply flatten the 2D features into a sequence to better match the Transformer. We believe these operations ignore the spatial prior of the target object, which may lead to sub-optimal results only. In addition, many works demonstrate that self-attention is actually a low-pass filter, which is independent of input features or keys/queries. That is to say, it may suppress the high-frequency component of the input features and preserve or even amplify the low-frequency information. To handle these issues, in this paper, we propose a unified Spatial-Frequency Transformer that models the Gaussian spatial Prior and High-frequency emphasis Attention (GPHA) simultaneously. To be specific, Gaussian spatial prior is generated using dual Multi-Layer Perceptrons (MLPs) and injected into the similarity matrix produced by multiplying Query and Key features in self-attention. The output will be fed into a softmax layer and then decomposed into two components, i.e., the direct and high-frequency signal. The low- and high-pass branches are rescaled and combined to achieve all-pass, therefore, the high-frequency features will be protected well in stacked self-attention layers. We further integrate the Spatial-Frequency Transformer into the Siamese tracking framework and propose a novel tracking algorithm termed SFTransT. The cross-scale fusion based SwinTransformer is adopted as the backbone, and also a multi-head cross-attention module is used to boost the interaction between search and template features. The output will be fed into the tracking head for target localization. Extensive experiments on short-term and long-term tracking benchmarks all demonstrate the effectiveness of our proposed framework. Source code will be released at https://github.com/Tchuanm/SFTransT.git.

A Spectral–Spatial Transformer Fusion Method for Hyperspectral Video Tracking

Spectral-Spatial-Temporal Attention Network for Hyperspectral Tracking.

SiamHYPER: Learning a Hyperspectral Object Tracker from an RGB-Based Tracker

An Anchor-Free Siamese Target Tracking Network for Hyperspectral Video.

SSTtrack: A Unified Hyperspectral Video Tracking Framework via Modeling Spectral-Spatial-Temporal Conditions

SSF-Net: Spatial-Spectral Fusion Network with Spectral Angle Awareness for Hyperspectral Object Tracking

Unsupervised Deep Hyperspectral Video Target Tracking and High Spectral-Spatial-Temporal Resolution (H³ Benchmark Dataset

Object Tracking in Hyperspectral-Oriented Video with Fast Spatial-Spectral Features

Hyperspectral Video Tracker Based on Spectral Deviation Reduction and a Double Siamese Network

Transformer-Based Band Regrouping With Feature Refinement for Hyperspectral Object Tracking

HHTrack: Hyperspectral Object Tracking Using Hybrid Attention

Hyperspectral Video Target Tracking based on Pixel-wise Spectral Matching Reduction and Deep Spectral Cascading Texture Features

Learning Spatial-Frequency Transformer for Visual Object Tracking

SENSE: Hyperspectral Video Object Tracker via Fusing Material and Motion Cues

TFTN: A Transformer-Based Fusion Tracking Framework of Hyperspectral and RGB

Hyperspectral Attention Network for Object Tracking

Learning a Deep Ensemble Network with Band Importance for Hyperspectral Object Tracking.

Transformer Tracking via Frequency Fusion

Exploring reliable infrared object tracking with spatio-temporal fusion transformer

SCSTCF: Spatial-Channel Selection and Temporal Regularized Correlation Filters for visual tracking

Hierarchical Spectral–Spatial Transformer for Hyperspectral and Multispectral Image Fusion