Transformer Visual Tracker Based on Template Features Corresponding to Foreground Region.

Jianglei Yu,Xin Ma
DOI: https://doi.org/10.1109/icip46576.2022.9897717
2022-01-01
Abstract:In visual tracking, the size of template patch on image is usually several times the size of object bounding box. So the background information around object would be encoded into some template features. However, these features would also be matched with search features, which interferes with the tracker's ability to accurately separate the object from the background. In this work, we present a novel feature fusion network based on Transformer for visual tracking. Specifically, to reduce the interference of background information in template patch, we extract the template features corresponding to foreground region on image, called TFFR, and fuse them with search features by attention mechanism. On that basis, we design a concise Transformer visual tracker based on TFFR, called TVT-TFFR. Extensive experiments show that our TVT-TFFR achieves state-of-the-art performance on several prevalent tracking benchmarks, and runs at 38 FPS, meeting the real-time requirement.
What problem does this paper attempt to address?