A Robust Attention-Enhanced Network with Transformer for Visual Tracking.

Fengwei Gu,Jun Lu,Chengtao Cai
DOI: https://doi.org/10.1007/s11042-023-15168-5
IF: 2.577
2023-01-01
Multimedia Tools and Applications
Abstract:Recently, Siamese-based trackers have become particularly popular. The correlation module in these trackers is responsible for fusing the feature information from the template and the search region, to obtain the response results. However, there are very rich contextual information and feature dependencies among video sequences, and it is difficult for a simple correlation module to efficiently integrate useful information. Therefore, the tracker encounters the challenges of information loss and local optimal solutions. In this work, we propose a novel attention-enhanced network with a Transformer variant for robust visual tracking. The proposed method carefully designs the local feature information association module (LFIA) and the global feature information fusion module (GFIF) based on the attention mechanism, which can effectively utilize contextual information and feature dependencies to enhance feature information. Our approach transforms the visual tracking problem into a bounding box prediction problem, using only a simple prediction network for object localization, without any prior knowledge. Ultimately, we propose a robust tracker called RANformer. Experiments show that the proposed tracker achieves state-of-the-art performance on 7 popular tracking benchmarks while meeting real-time requirements with a speed exceeding 40FPS.
What problem does this paper attempt to address?