Abstract:Many state-of-the-art RGB-T trackers have achieved remarkable results through modality fusion. However, these trackers often either overlook temporal information or fail to fully utilize it, resulting in an ineffective balance between multi-modal and temporal information. To address this issue, we propose a novel Cross Fusion RGB-T Tracking architecture (CFBT) that ensures the full participation of multiple modalities in tracking while dynamically fusing temporal information. The effectiveness of CFBT relies on three newly designed cross spatio-temporal information fusion modules: Cross Spatio-Temporal Augmentation Fusion (CSTAF), Cross Spatio-Temporal Complementarity Fusion (CSTCF), and Dual-Stream Spatio-Temporal Adapter (DSTA). CSTAF employs a cross-attention mechanism to enhance the feature representation of the template comprehensively. CSTCF utilizes complementary information between different branches to enhance target features and suppress background features. DSTA adopts the adapter concept to adaptively fuse complementary information from multiple branches within the transformer layer, using the RGB modality as a medium. These ingenious fusions of multiple perspectives introduce only less than 0.3\% of the total modal parameters, but they indeed enable an efficient balance between multi-modal and temporal information. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate that our method achieves new state-of-the-art performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of existing RGB - T (visible light and thermal infrared) trackers in handling multimodal and temporal information. Specifically: 1. **Multimodal information fusion problem**: Although existing RGB - T trackers have achieved remarkable results through modal fusion, they often overlook temporal information or fail to make full use of it, resulting in poor performance in balancing multimodal and temporal information. 2. **Robustness problem in complex scenes**: In complex scenes, the target object may encounter problems such as occlusion and deformation, which will lead to tracking failure. Simply relying on the initial template for tracking cannot effectively deal with these situations, so temporal information needs to be introduced to assist tracking. 3. **Parameter efficiency problem**: When introducing a temporal information fusion module, a large number of additional parameters are usually added, which is particularly unfavorable for RGB - T tracking tasks because the number of RGB - T datasets is far smaller than that of RGB datasets, and it is difficult to train a model with superior performance. To solve these problems, the author proposes a new cross - fusion RGB - T tracking architecture (CFBT), which ensures that multiple modalities are fully involved in the tracking process and dynamically fuses temporal information. The key innovation points of CFBT include three newly - designed cross - spatio - temporal information fusion modules: Cross - Spatio - Temporal Augmented Fusion Module (CSTAF), Cross - Spatio - Temporal Complementary Fusion Module (CSTCF), and Dual - Stream Spatio - Temporal Adapter (DSTA). These modules aim to effectively combine multimodal and temporal information, thereby improving the accuracy and robustness of tracking. ### Specific problem description - **Insufficient use of multimodal information**: Although existing RGB - T trackers can use multimodal information, they ignore the information in the time dimension during the fusion process. - **Inadequate use of temporal information**: Many RGB - T trackers, when dealing with temporal information, only focus on the interaction between templates and fail to make full use of the temporal information in different branches. - **Low parameter efficiency**: Introducing a complex temporal information fusion module will lead to a significant increase in the number of parameters, and the RGB - T dataset is small, making it difficult to support the training of a large number of parameters. ### Solution The CFBT architecture proposed by the author solves the above problems in the following ways: 1. **Cross - Spatio - Temporal Augmented Fusion Module (CSTAF)**: Use the cross - attention mechanism to comprehensively enhance the feature representation of the template, ensuring that the template is more focused on the target object. 2. **Cross - Spatio - Temporal Complementary Fusion Module (CSTCF)**: Utilize the complementary information between different branches to enhance the target features and suppress the background features, thereby improving the tracking accuracy. 3. **Dual - Stream Spatio - Temporal Adapter (DSTA)**: Adopt the adapter concept to adaptively fuse the complementary information from multiple branches in the transformer layer, and use the RGB modality as a medium to achieve efficient temporal information fusion. Through these innovations, CFBT can effectively balance multimodal and temporal information with fewer additional parameters, thereby achieving the latest and best performance in multiple RGB - T tracking benchmark tests. ### Conclusion CFBT successfully solves the deficiencies of existing RGB - T trackers in multimodal and temporal information fusion by introducing novel cross - spatio - temporal information fusion modules, and significantly improves the robustness and accuracy of tracking.

Cross Fusion RGB-T Tracking with Bi-directional Adapter

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

Object fusion tracking for RGB-T images via channel swapping and modal mutual attention

QueryTrack: Joint-Modality Query Fusion Network for RGBT Tracking

Cross-modulated Attention Transformer for RGBT Tracking

Exploring fusion strategies for accurate RGBT visual object tracking

RGBT Image Fusion Tracking via Sparse Trifurcate Transformer Aggregation Network

RGB-T Tracking Based on Mixed Attention

Multi-Stage Fusion for Event-based Multimodal Tracker

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Temporal Adaptive RGBT Tracking with Modality Prompt

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

RGBT tracking via cross-modality message passing

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

TFTN: A Transformer-Based Fusion Tracking Framework of Hyperspectral and RGB

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Learning Dual-Fused Modality-Aware Representations for RGBD Tracking

AFter: Attention-based Fusion Router for RGBT Tracking

RGB-T Tracking with Template-Bridged Search Interaction and Target-Preserved Template Updating

Multi-modal multi-task feature fusion for RGBT tracking