MTAtrack: Multilevel Transformer Attention for Visual Tracking

Dong An,Fan Zhang,Yuqian Zhao,Biao Luo,Chunhua Yang,Baifan Chen,Lingli Yu
DOI: https://doi.org/10.1016/j.optlastec.2023.109659
2023-01-01
Abstract:Compared with traditional trackers based on convolutional neural network, Transformer tracking algorithms achieve the state-of-the-art performance owing to its ability of capturing long-range global interactions of image features. The current trackers based on Transformer only exploits the highest attention features from the last encoder layer, resulting in losing some boundary information of object targets from the front attention features. In this work, we propose an effective multilevel Transformer attention tracking algorithm named MTAtrack for predicting the target position. In a Transformer-like module, the hierarchical global feature dependencies between target templates and search regions are modeled by self-attention mechanism, and are fused by mutual attention mechanism for learning robust representation of the target positions. Additionally, the low-level attention features are converted into spatial weights to further strengthen the boundary details of the targets. Comprehensive experiments prove that our algorithm can enrich the feature description of target appearances and improve the robustness of locating the tracked targets. The proposed tracker performs favorably against the state-of-the-art tracking algorithms on five challenging datasets, especially on large-scale GOT-10k, TrackingNet, and LaSOT in terms of accuracy and efficiency. The source codes are available at https://github.com/adda1221/MTAtrack.
What problem does this paper attempt to address?