Abstract:The fusion tracking of RGB and thermal infrared image (RGBT) has attracted widespread interest within target tracking by leveraging the complementing benefits of information from both visible and thermal infrared modalities, but achieving robustness while operating in real time remains a challenge. Aimed at this problem, this paper proposes a real-time tracking network based on the attention mechanism, which can improve the tracking speed with a smaller model, and at the same time, introduce the attention mechanism in the module to strengthen the attention to the important features, which can guarantee a certain tracking accuracy. Specifically, the modal features of visible and thermal infrared are extracted separately by using the backbone of the dual-stream structure; then, the important features in the two modes are selected and enhanced by using the channel attention mechanism in the feature selection enhancement module (FSEM) and the Transformer, while noise is reduced by using gating circuits. Finally, the final enhancement fusion is performed by using the spatial channel adaptive adjustment fusion module (SCAAM) in both the spatial and channel dimensions. The PR/SR of the proposed algorithm tested on the GTOT, RGBT234 and LasHeR datasets are 90.0%/73.0%, 84.4%/60.2%, and 46.8%/34.3%, respectively, and generally good tracking accuracy has been achieved, with a speed of up to 32.3067 fps, meeting the model's real-time requirement.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to achieve efficient fusion of RGB (visible light) and T (thermal infrared) images in real-time target tracking while ensuring tracking accuracy. Specifically, the paper proposes a real-time tracking network based on an attention mechanism to tackle challenges in existing RGBT target tracking methods, such as insufficient modal information fusion, high computational complexity, and slow tracking speed. This network aims to improve tracking speed, reduce redundant information, and ensure feature hierarchy by introducing an attention mechanism, thereby achieving efficient modal information fusion while maintaining a certain level of tracking accuracy. ### Main Contributions 1. **Proposed a real-time tracking network based on an attention mechanism**: This network uses the attention mechanism to achieve feature enhancement, improving tracking speed while ensuring tracking accuracy. The enhanced fusion operation in the last layer reduces computational complexity and redundant information. 2. **Designed a feature selection enhancement module**: This module uses channel attention mechanisms to adaptively select and fuse features learned from different convolution kernels and combines Transformer to explore rich contextual information, thereby enhancing useful information and suppressing unimportant information, improving tracking performance. 3. **Constructed a spatial-channel adaptive adjustment fusion module**: This module can adjust and fuse previously received information in spatial and channel dimensions, better guiding the tracker to produce better tracking results. ### Method Overview The real-time tracking network framework based on the attention mechanism proposed in the paper mainly includes the following parts: - **Dual-stream structure**: Using the first 3 layers of VGG m as the backbone network to extract features of RGB and TIR images. Although these two feature extractors have the same structure, their parameters are different. - **Feature selection enhancement module**: This module obtains features of different scales through convolution kernels of different sizes, and then uses the encoder and decoder of Transformer to improve the data fusion operation, extracting and enhancing important features. - **Spatial-channel adaptive adjustment fusion module**: This module further fuses the improved features, capturing useful features and adaptively fusing this information. - **Accurate pooling layer**: Used to accelerate feature extraction while maintaining the quality of extracted features. - **Fully connected layer and Softmax layer**: Used to predict the position of the target, achieving target tracking. Through these steps, the paper aims to address the high computational complexity and slow tracking speed issues in existing RGBT target tracking methods while maintaining high tracking accuracy.

Real-Time RGBT Target Tracking Based on Attention Mechanism

RGB-T Tracking Based on Mixed Attention

SiamCAF: Complementary Attention Fusion-Based Siamese Network for RGBT Tracking

Unsupervised RGB-T object tracking with attentional multi-modal feature fusion

Learning Modality Complementary Features with Mixed Attention Mechanism for RGB-T Tracking

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Temporal Adaptive RGBT Tracking with Modality Prompt

RGB-T Tracking with Template-Bridged Search Interaction and Target-Preserved Template Updating

AMATrack: A Unified Network With Asymmetric Multimodal Mixed Attention for RGBD Tracking

Multi-Scale Feature Interactive Fusion Network for RGBT Tracking

Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion

QueryTrack: Joint-Modality Query Fusion Network for RGBT Tracking

Dynamic Fusion Network for RGBT Tracking

Object fusion tracking for RGB-T images via channel swapping and modal mutual attention

Multi-Level Fusion for Robust RGBT Tracking via Enhanced Thermal Representation

Robust RGB-T Tracking via Graph Attention-Based Bilinear Pooling

MIRNet: A Robust RGBT Tracking Jointly with Multi-Modal Interaction and Refinement

Learning a Multimodal Feature Transformer for RGBT Tracking

Visible and Infrared Object Tracking Based on Multimodal Hierarchical Relationship Modeling

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens

RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided Learning