Abstract:Although the red, green, and blue (RGB) image has a high spatial resolution, it only depicts color intensities in RGB channels, which easily leads to the failure of the tracker based on RGB modality in some challenging scenarios, for example, when the color of the object and background is similar. The hyperspectral image with rich spectral information is more robust in these difficult situations, so it is essential to explore how to effectively apply hyperspectral features to supplement RGB information in object tracking. However, there is no fusion tracking algorithm based on hyperspectral and RGB data. Based on this, we propose a novel fusion tracking framework of hyperspectral and RGB in this article, termed as transformer-based fusion tracking network (TFTN), to enhance the performance of object tracking. Within the framework, we construct a dual-branch structure based on the Siamese network to obtain the modality-specific representations of different modality images. Besides, the framework is generic, which is suitable for the Siamese series of tracking algorithms. In addition, we design a Siamese 3-D convolutional neural network as the specific branch of hyperspectral modality for synchronous extraction of the spatial and spectral features of hyperspectral data, to give full play to the role of hyperspectral data in improving network tracking performance. Particularly, inspired by the structure of Transformer, we design a transformer-based fusion module to capture the potential interaction of intramodality and intermodality features of different modalities. This is the first work that combines the information of hyperspectral and RGB modalities to improve tracking performance. At the same time, it is also the first time that employs the self-attention module of Transformer to combine the information of different modalities for multimodality fusion tracking. Experimental results on the dataset composed of hyperspectral and RGB image sequences show that the proposed TFTN tracker is superior to the state-of-the-art trackers, demonstrating the effectiveness of this method.

Learning a Multimodal Feature Transformer for RGBT Tracking

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Object fusion tracking for RGB-T images via channel swapping and modal mutual attention

Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal Representation

Exploring Multi-Modal Spatial-Temporal Contexts for High-Performance RGB-T Tracking

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

Visible and Infrared Object Tracking via Convolution-Transformer Network With Joint Multimodal Feature Learning

RGB-T Tracking Based on Mixed Attention

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

MTNet: Learning Modality-aware Representation with Transformer for RGBT Tracking

TFTN: A Transformer-Based Fusion Tracking Framework of Hyperspectral and RGB

RGBT Image Fusion Tracking via Sparse Trifurcate Transformer Aggregation Network

RGB-T tracking by modality difference reduction and feature re-selection

Robust RGB-T Tracking via Graph Attention-Based Bilinear Pooling

Multi-modal multi-task feature fusion for RGBT tracking

CMC2R: Cross‐modal Collaborative Contextual Representation for RGBT Tracking

Dynamic Disentangled Fusion Network for RGBT Tracking