Multi-modal interaction with token division strategy for RGB-T tracking

Yujue Cai,Xiubao Sui,Guohua Gu,Qian Chen
DOI: https://doi.org/10.1016/j.patcog.2024.110626
IF: 8
2024-06-08
Pattern Recognition
Abstract:RGB-T tracking takes visible and infrared images as inputs, which is an extended application of multi-modal fusion in the field of visual object tracking. The complementarity between visible and infrared modalities can enhance the robustness of tracker in complex scenes. Cross-modal interaction can facilitate the fusion and synergy of different modalities, but most previous methods lack clear target information in multi-modal fusion, leading to some undesired cross-relation in interaction. To reduce these undesired cross-relations, we propose a Multi-modal Interaction scheme Guided by Token Division strategy (MIGTD). This scheme divides the input multi-modal tokens into several categories and restricts the interaction between tokens by setting different rules. The above operation is implemented in parallel through an attention masking strategy. To accurately classify search tokens, an instance segmentation task with box-supervised loss is employed. We conduct extensive experiments on three popular benchmark datasets, RGBT234, LasHeR and VTUAV. The experimental results indicate that the tracker proposed in this article reach the world's advanced level in performance.
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?