Abstract:Transformer has successfully replaced the CNN-based cross-correlation operation on visual tracking tasks, significantly improving tracking performance. The attention mechanism, the core module of Transformer, can get rid of the long-range dependency dilemma while the CNN-based method cannot. However, the tracking performance is very susceptible to the interference from secondary background information since the global-level attention mechanism cannot concentrate on the primary target information. In our work, we propose a residual self-attention mechanism appropriate for object tracking tasks to concentrate on the primary target information. Furthermore, we design ResAT, an effective and concise tracker that adopts a residual attention mechanism and a residual strategy specifically tailored for object tracking tasks. Our tracker ResAT outperforms all previous SOTA trackers on challenging large-scale benchmarks, including GOT-10k, LaSOT, TrackingNet, and UAV123, while running at 45 FPS.

ResAT: Visual Tracking with Residual Attention