Abstract:Although the red, green, and blue (RGB) image has a high spatial resolution, it only depicts color intensities in RGB channels, which easily leads to the failure of the tracker based on RGB modality in some challenging scenarios, for example, when the color of the object and background is similar. The hyperspectral image with rich spectral information is more robust in these difficult situations, so it is essential to explore how to effectively apply hyperspectral features to supplement RGB information in object tracking. However, there is no fusion tracking algorithm based on hyperspectral and RGB data. Based on this, we propose a novel fusion tracking framework of hyperspectral and RGB in this article, termed as transformer-based fusion tracking network (TFTN), to enhance the performance of object tracking. Within the framework, we construct a dual-branch structure based on the Siamese network to obtain the modality-specific representations of different modality images. Besides, the framework is generic, which is suitable for the Siamese series of tracking algorithms. In addition, we design a Siamese 3-D convolutional neural network as the specific branch of hyperspectral modality for synchronous extraction of the spatial and spectral features of hyperspectral data, to give full play to the role of hyperspectral data in improving network tracking performance. Particularly, inspired by the structure of Transformer, we design a transformer-based fusion module to capture the potential interaction of intramodality and intermodality features of different modalities. This is the first work that combines the information of hyperspectral and RGB modalities to improve tracking performance. At the same time, it is also the first time that employs the self-attention module of Transformer to combine the information of different modalities for multimodality fusion tracking. Experimental results on the dataset composed of hyperspectral and RGB image sequences show that the proposed TFTN tracker is superior to the state-of-the-art trackers, demonstrating the effectiveness of this method.

MTNet: Learning Modality-aware Representation with Transformer for RGBT Tracking

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Transformer Union Convolution Network for Visual Object Tracking

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Learning a Multimodal Feature Transformer for RGBT Tracking

Exploring Multi-Modal Spatial-Temporal Contexts for High-Performance RGB-T Tracking

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

RGBT Image Fusion Tracking via Sparse Trifurcate Transformer Aggregation Network

MIRNet: A Robust RGBT Tracking Jointly with Multi-Modal Interaction and Refinement

Visible and Infrared Object Tracking via Convolution-Transformer Network With Joint Multimodal Feature Learning

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

Siamese transformer RGBT tracking

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

X Modality Assisting RGBT Object Tracking

CMC2R: Cross‐modal Collaborative Contextual Representation for RGBT Tracking

RGB-T Tracking Based on Mixed Attention

Cross-modulated Attention Transformer for RGBT Tracking

SiamMGT: robust RGBT tracking via graph attention and reliable modality weight learning

Temporal Adaptive RGBT Tracking with Modality Prompt

TFTN: A Transformer-Based Fusion Tracking Framework of Hyperspectral and RGB