Visible and Infrared Object Tracking via Convolution-Transformer Network With Joint Multimodal Feature Learning

Jiazhu Qiu,Rui Yao,Yong Zhou,Peng Wang,Yanning Zhang,Hancheng Zhu
DOI: https://doi.org/10.1109/lgrs.2023.3259583
IF: 5.343
2023-04-01
IEEE Geoscience and Remote Sensing Letters
Abstract:The existing Transformer-based redgreenblue-thermal (RGBT) tracker mainly focuses on the enhancement of features extracted by convolutional neural network (CNN). The potential of the Transformer in representation learning remains underexplored. In this letter, we propose a Convolution-Transformer network with joint multimodal feature learning (JMFL), in which both representation learning and feature fusion leverage Transformer. Specifically, we use the multibranch Convolution-Transformer feature extraction network to process the extraction task of local modality-independent features and global modality-shared features, respectively. Several simplified Transformer encoder layers form the Transformer backbone network, which is more suitable for real-time object tracking. Besides, we found that intermodality correlation is an important factor for modality interactions and mutual exploitation. Therefore, we propose a JMFL module, which uses cross-attention to capture the dependencies of cross-modal and enhance multimodal fusion by bidirectional guidance of multimodal information. The proposed method is fully experimented on two large benchmark datasets and compared with some current well-performing methods. The experimental results show that the proposed method performs well in terms of tracking accuracy and speed.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?