Real-Time RGBT Target Tracking Based on Attention Mechanism

Qian Zhao,Jun Liu,Junjia Wang,Xingzhong Xiong
DOI: https://doi.org/10.3390/electronics13132517
IF: 2.9
2024-06-27
Electronics
Abstract:The fusion tracking of RGB and thermal infrared image (RGBT) has attracted widespread interest within target tracking by leveraging the complementing benefits of information from both visible and thermal infrared modalities, but achieving robustness while operating in real time remains a challenge. Aimed at this problem, this paper proposes a real-time tracking network based on the attention mechanism, which can improve the tracking speed with a smaller model, and at the same time, introduce the attention mechanism in the module to strengthen the attention to the important features, which can guarantee a certain tracking accuracy. Specifically, the modal features of visible and thermal infrared are extracted separately by using the backbone of the dual-stream structure; then, the important features in the two modes are selected and enhanced by using the channel attention mechanism in the feature selection enhancement module (FSEM) and the Transformer, while noise is reduced by using gating circuits. Finally, the final enhancement fusion is performed by using the spatial channel adaptive adjustment fusion module (SCAAM) in both the spatial and channel dimensions. The PR/SR of the proposed algorithm tested on the GTOT, RGBT234 and LasHeR datasets are 90.0%/73.0%, 84.4%/60.2%, and 46.8%/34.3%, respectively, and generally good tracking accuracy has been achieved, with a speed of up to 32.3067 fps, meeting the model's real-time requirement.
engineering, electrical & electronic,physics, applied,computer science, information systems
What problem does this paper attempt to address?
The paper attempts to address the problem of how to achieve efficient fusion of RGB (visible light) and T (thermal infrared) images in real-time target tracking while ensuring tracking accuracy. Specifically, the paper proposes a real-time tracking network based on an attention mechanism to tackle challenges in existing RGBT target tracking methods, such as insufficient modal information fusion, high computational complexity, and slow tracking speed. This network aims to improve tracking speed, reduce redundant information, and ensure feature hierarchy by introducing an attention mechanism, thereby achieving efficient modal information fusion while maintaining a certain level of tracking accuracy. ### Main Contributions 1. **Proposed a real-time tracking network based on an attention mechanism**: This network uses the attention mechanism to achieve feature enhancement, improving tracking speed while ensuring tracking accuracy. The enhanced fusion operation in the last layer reduces computational complexity and redundant information. 2. **Designed a feature selection enhancement module**: This module uses channel attention mechanisms to adaptively select and fuse features learned from different convolution kernels and combines Transformer to explore rich contextual information, thereby enhancing useful information and suppressing unimportant information, improving tracking performance. 3. **Constructed a spatial-channel adaptive adjustment fusion module**: This module can adjust and fuse previously received information in spatial and channel dimensions, better guiding the tracker to produce better tracking results. ### Method Overview The real-time tracking network framework based on the attention mechanism proposed in the paper mainly includes the following parts: - **Dual-stream structure**: Using the first 3 layers of VGG m as the backbone network to extract features of RGB and TIR images. Although these two feature extractors have the same structure, their parameters are different. - **Feature selection enhancement module**: This module obtains features of different scales through convolution kernels of different sizes, and then uses the encoder and decoder of Transformer to improve the data fusion operation, extracting and enhancing important features. - **Spatial-channel adaptive adjustment fusion module**: This module further fuses the improved features, capturing useful features and adaptively fusing this information. - **Accurate pooling layer**: Used to accelerate feature extraction while maintaining the quality of extracted features. - **Fully connected layer and Softmax layer**: Used to predict the position of the target, achieving target tracking. Through these steps, the paper aims to address the high computational complexity and slow tracking speed issues in existing RGBT target tracking methods while maintaining high tracking accuracy.