VTT: Long-term Visual Tracking with Transformers

Tianling Bian,Yang Hua,Tao Song,Zhengui Xue,Ruhui Ma,Neil Robertson,Haibing Guan
DOI: https://doi.org/10.1109/icpr48806.2021.9412156
2020-01-01
Abstract:Long-term visual tracking is a challenging problem. State-of-the-art long-term trackers, e.g., GlobalTrack, utilize region proposal networks (RPNs) to generate target proposals. However, the performance of the trackers is affected by occlusions and large scale or ratio variations. To address these issues, in this paper, we are the first to propose a novel architecture with transformers for long-term visual tracking. Specifically, the proposed Visual Tracking Transformer (VTT) utilizes a transformer encoder-decoder architecture for aggregating global information to deal with occlusion and large scale or ratio variation. Furthermore, it also shows better discriminative power against instance-level distractors without the need for extra labeling and hard-sample mining. We conduct extensive experiments on three large-scale long-term tracking datasets and have achieved state-of-the-art performance.
What problem does this paper attempt to address?