Learning a Multimodal Feature Transformer for RGBT Tracking

Huiwei Shi,Xiaodong Mu,Danyao Shen,Chengliang Zhong
DOI: https://doi.org/10.1007/s11760-024-03148-7
2024-01-01
Abstract:RGB-thermal (RGBT) tracking aims to achieve reliable visual tracking effects, especially in challenging environments characterized by drastic illumination changes, adverse weather conditions, and background clutter, enabling robust tracking in all-day and all-weather scenarios through the utilization of multimodal complementary information. Despite the significant progress achieved in this field, certain existing dual-stream RGBT object tracking methods tend to suppress low-quality or low-contribution modal features during the fusion phase, consequently limiting the ability to attain further tracking performance improvements. To address this limitation, this paper proposes a novel dual-stream hierarchical transformer fusion network that has an enhanced capacity to use local and global discriminative information derived from both the RGB and thermal modalities. Our approach incorporates a multimodal feature transformer encoder, which is enriched with modulation layers that adaptively extract modality-specific features. This adaptive fusion process effectively combines both low-quality and high-quality modal information, thus enhancing the ability of the network to represent the modal features contained in both the RGB and thermal branches. Additionally, we leverage dynamic anchor boxes and denoising-based training methods to accelerate the dual-stream transformer training process. The effectiveness of our proposed method is demonstrated through comprehensive experimental results on RGBT datasets, where it outperforms the state-of-the-art tracking methods, demonstrating its superiority in challenging tracking scenarios.
What problem does this paper attempt to address?