Efficient Transformer-based Single-stage RGB-T Tracking with Larger Search Region and Acceleration Mechanism

Jianqiang Xia,Xiaolei Wang,Dianxi Shi,Songchang Jin,Chenran Zhao,Linna Song,Qianying Ouyang
DOI: https://doi.org/10.1145/3663976.3664032
2024-01-01
Abstract:Most current RGB-T tracking networks based on pure Transformer belong to local trackers and are designed in small local search regions. Although existing joint feature extraction, fusion, and relation modeling method greatly simplifies the design of RGB-T target tracking networks and achieve performance improvements, these networks still perform fusion tracking within limited local search regions, which is not suitable for challenging scenarios where objects move rapidly in the image. Search region being too small may result in subsequent frames cropping out search regions that fail to encompass the target, leading to the network's inability to effectively track the target and a subsequent decline in tracking performance. Therefore, we first propose a network that uses a larger search region and joint feature extraction, fusion, and relation modeling methods to achieve RGB-T single-stage tracking. Secondly, considering that the enlargement of the search region may introduce more background noise information, and the computational load of the attention mechanism will also increase quadratically, which will significantly reduce the inference speed of the network and affect the tracking performance. Therefore, we propose a modality feature asynchronous elimination mechanism to accelerate joint feature extraction, fusion, and relation modeling methods, thus obtaining an RGB-T tracking network SLATrack that can accurately track targets in a larger search region. Extensive experiments on three commonly used RGB-T tracking benchmarks have shown that our method achieves excellent results while maintaining good real-time performance.
What problem does this paper attempt to address?