Abstract:Although the red, green, and blue (RGB) image has a high spatial resolution, it only depicts color intensities in RGB channels, which easily leads to the failure of the tracker based on RGB modality in some challenging scenarios, for example, when the color of the object and background is similar. The hyperspectral image with rich spectral information is more robust in these difficult situations, so it is essential to explore how to effectively apply hyperspectral features to supplement RGB information in object tracking. However, there is no fusion tracking algorithm based on hyperspectral and RGB data. Based on this, we propose a novel fusion tracking framework of hyperspectral and RGB in this article, termed as transformer-based fusion tracking network (TFTN), to enhance the performance of object tracking. Within the framework, we construct a dual-branch structure based on the Siamese network to obtain the modality-specific representations of different modality images. Besides, the framework is generic, which is suitable for the Siamese series of tracking algorithms. In addition, we design a Siamese 3-D convolutional neural network as the specific branch of hyperspectral modality for synchronous extraction of the spatial and spectral features of hyperspectral data, to give full play to the role of hyperspectral data in improving network tracking performance. Particularly, inspired by the structure of Transformer, we design a transformer-based fusion module to capture the potential interaction of intramodality and intermodality features of different modalities. This is the first work that combines the information of hyperspectral and RGB modalities to improve tracking performance. At the same time, it is also the first time that employs the self-attention module of Transformer to combine the information of different modalities for multimodality fusion tracking. Experimental results on the dataset composed of hyperspectral and RGB image sequences show that the proposed TFTN tracker is superior to the state-of-the-art trackers, demonstrating the effectiveness of this method.

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

RGB-T Tracking Based on Mixed Attention

Object fusion tracking for RGB-T images via channel swapping and modal mutual attention

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

Exploring fusion strategies for accurate RGBT visual object tracking

Multi-modal multi-task feature fusion for RGBT tracking

Learning a Multimodal Feature Transformer for RGBT Tracking

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Real-Time RGBT Target Tracking Based on Attention Mechanism

Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal Representation

Robust RGB-T Tracking via Graph Attention-Based Bilinear Pooling

Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking

QueryTrack: Joint-Modality Query Fusion Network for RGBT Tracking

AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking

Multi-Stage Fusion for Event-based Multimodal Tracker

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

TFTN: A Transformer-Based Fusion Tracking Framework of Hyperspectral and RGB

Feature enhancement and coarse-to-fine detection for RGB-D tracking

Cross Fusion RGB-T Tracking with Bi-directional Adapter

SiamCAF: Complementary Attention Fusion-Based Siamese Network for RGBT Tracking

RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision