Abstract:How to better fuse cross-modal features is the core issue of RGB-T tracking. Some previous methods either insufficiently fuse RGB and TIR features, or depend on intermediaries containing information from both modalities to achieve cross-modal information interaction. The former does not fully exploit the potential of using only RGB and TIR information of the template or search region for channel and spatial feature fusion, and the latter lacks direct interaction between the template and search area, which limits the model's ability to fully exploit the original semantic information of both modalities. To alleviate these limitations, we explore how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features, and propose CSTNet. CSTNet uses ViT as a backbone and inserts cross-modal channel feature fusion modules (CFM) and cross-modal spatial feature fusion modules (SFM) for direct interaction between RGB and TIR features. The CFM performs parallel joint channel enhancement and joint multilevel spatial feature modeling of RGB and TIR features and sums the features, and then globally integrates the sum feature with the original features. The SFM uses cross-attention to model the spatial relationship of cross-modal features and then introduces a convolutional feedforward network for joint spatial and channel integration of multimodal features. We retrain the model with CSNet as the pre-training weights in the model with CFM and SFM removed, and propose CSTNet-small, which achieves 36% reduction in parameters and 24% reduction in Flops, and 50% speedup with a 1-2% performance decrease. Comprehensive experiments show that CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks. Code is available at <a class="link-external link-https" href="https://github.com/LiYunfengLYF/CSTNet" rel="external noopener nofollow">this https URL</a>.

RGBT Tracking by Fully-Convolutional Triple Networks with Cosine Embedding Loss

Unidirectional Cross-Modal Fusion for RGB-T Tracking

RGB-T Tracking Based on Mixed Attention

Robust RGB-T Tracking via Graph Attention-Based Bilinear Pooling

Learning Modality Feature Fusion Via Transformer for RGBT-tracking

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

CIRNet: an Improved RGBT Tracking Via Cross-Modality Interaction and Re-Identification

Learning a Multimodal Feature Transformer for RGBT Tracking

Special attribute-based cross-modal interactive fusion network for RGBT tracking

SiamCAF: Complementary Attention Fusion-Based Siamese Network for RGBT Tracking

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Learning Multi-Layer Attention Aggregation Siamese Network for Robust RGBT Tracking

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Learning Reliable Modal Weight with Transformer for Robust RGBT Tracking

Residual Learning-Based Two-Stream Network for RGB-T Object Tracking

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Dual-Modality Feature Extraction Network Based on Graph Attention for RGBT Tracking

CMC2R: Cross‐modal Collaborative Contextual Representation for RGBT Tracking

Dual-Modality Space-Time Memory Network for RGBT Tracking.

RGBT Image Fusion Tracking via Sparse Trifurcate Transformer Aggregation Network

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning