Abstract:Existing RGBT tracking methods often design various interaction models to perform cross-modal fusion of each layer, but can not execute the feature interactions among all layers, which plays a critical role in robust multimodal representation, due to large computational burden. To address this issue, this paper presents a novel All-layer multimodal Interaction Network, named AINet, which performs efficient and effective feature interactions of all modalities and layers in a progressive fusion Mamba, for robust RGBT tracking. Even though modality features in different layers are known to contain different cues, it is always challenging to build multimodal interactions in each layer due to struggling in balancing interaction capabilities and efficiency. Meanwhile, considering that the feature discrepancy between RGB and thermal modalities reflects their complementary information to some extent, we design a Difference-based Fusion Mamba (DFM) to achieve enhanced fusion of different modalities with linear complexity. When interacting with features from all layers, a huge number of token sequences (3840 tokens in this work) are involved and the computational burden is thus large. To handle this problem, we design an Order-dynamic Fusion Mamba (OFM) to execute efficient and effective feature interactions of all layers by dynamically adjusting the scan order of different layers in Mamba. Extensive experiments on four public RGBT tracking datasets show that AINet achieves leading performance against existing state-of-the-art methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the RGBT tracking task, although existing methods have designed various interaction models to perform cross - modal fusion between different layers, they are unable to perform feature interaction between all layers, which plays a crucial role in robust multimodal representation. Due to the huge computational burden, this problem has not been solved yet. Specifically, the existing RGBT tracking methods have limited feature interaction capabilities between different layers and it is difficult to balance the interaction capabilities and efficiency. Especially when dealing with a large number of feature sequences from all layers (such as 3,840 tokens in this paper), the computational burden is particularly huge. In addition, since the feature differences between the RGB and thermal infrared modalities reflect their complementary information, how to efficiently utilize this complementary information is also a challenge. To solve the above - mentioned problems, the paper proposes a novel All - layer multimodal Interaction Network (AINet), which can efficiently and effectively perform feature interactions of all modalities and layers in the Progressive Fusion Mamba, thereby achieving robust RGBT tracking. The main contributions of AINet include: 1. **Proposing a new all - layer multimodal interaction network**: This network not only realizes multi - modal interaction for each layer, but also realizes the interaction of all layers through the Progressive Fusion Mamba. As far as the authors know, this is the first time that the Mamba network has been introduced into RGBT tracking. 2. **Designing a Difference - based Fusion Mamba (DFM)**: By modeling modal differences to capture complementary information, modal - enhanced fusion is achieved and can be efficiently applied to each layer. 3. **Designing an Order - dynamic Fusion Mamba (OFM)**: Through an input - aware dynamic scanning scheme, the interaction of all - layer features is achieved, alleviating the problem of information forgetting of early input tokens. 4. **Extensive experimental verification**: Experiments on four publicly available RGBT tracking benchmark datasets show that AINet significantly outperforms the existing state - of - the - art methods in terms of both performance and efficiency. Through these innovations, AINet successfully solves the computational burden problem of existing RGBT tracking methods in multi - layer feature interaction, while improving the robustness and accuracy of tracking.

RGBT Tracking via All-layer Multimodal Interactions with Progressive Fusion Mamba

Background-aware Siamese Network Tracking Based on Salient Feature Fusion

Dynamic Disentangled Fusion Network for RGBT Tracking

Multi-Scale Feature Interactive Fusion Network for RGBT Tracking

Dynamic Fusion Network for RGBT Tracking

AFter: Attention-based Fusion Router for RGBT Tracking

RGBT tracking via cross-modality message passing

RGBT Image Fusion Tracking via Sparse Trifurcate Transformer Aggregation Network

MIRNet: A Robust RGBT Tracking Jointly with Multi-Modal Interaction and Refinement

SiamCAF: Complementary Attention Fusion-Based Siamese Network for RGBT Tracking

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning

MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking

Mamba-FETrack: Frame-Event Tracking via State Space Model

X Modality Assisting RGBT Object Tracking

SiamMGT: robust RGBT tracking via graph attention and reliable modality weight learning

Robust RGB-T Tracking via Graph Attention-Based Bilinear Pooling

FANet: Quality-Aware Feature Aggregation Network for Robust RGB-T Tracking

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

Cross Fusion RGB-T Tracking with Bi-directional Adapter

Learning a Multimodal Feature Transformer for RGBT Tracking

Exploring fusion strategies for accurate RGBT visual object tracking