Simplifying Cross-modal Interaction Via Modality-Shared Features for RGBT Tracking

Liqiu Chen,Yuqing Huang,Hengyu Li,Zikun Zhou,Zhenyu He
DOI: https://doi.org/10.1145/3664647.3681564
2024-01-01
Abstract:Thermal infrared(TIR) data exhibits higher tolerance to extreme environments, making it a valuable complement to RGB data in tracking tasks. RGBT tracking aims to leverage information from RGB and TIR images for stable and robust tracking. However, existing RGBT tracking methods face challenges due to significant modality differences and selective emphasis on interactive information, leading to inefficiencies in the cross-modal interaction. To address these issues, we propose a novel Integrating Interaction into Modality-shared Features with ViT(IIMF) framework, which is a simplified cross-modal interaction network including modality-shared, RGB modality-specific, and TIR modality-specific branches. The Modality-shared branch aggregates modality-shared information and implements inter-modal interaction. Specifically, our approach first extracts modality-shared features from RGB and TIR features with a cross-attention mechanism. Furthermore, we design a Cross-Attention-based Modality-shared Information Aggregation(CAMIA) module to further aggregate modality-shared information with modality-shared tokens. We evaluate our model on three widely-used benchmark datasets and extensive experiments demonstrate that our method achieves state-of-the-art performance. All the source code are released at https://github.com/Liqiu-Chen/IIMF.
What problem does this paper attempt to address?