Transformer Tracking via Frequency Fusion

Xiantao Hu,Bineng Zhong,Qihua Liang,Shengping Zhang,Ning Li,Xianxian Li,Rongrong Ji
DOI: https://doi.org/10.1109/tcsvt.2023.3289624
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Transformer has achieved impressive progress in visual tracking due to their capability of global modeling, which enables them to learn low-frequency features(i.e., high-level semantic information). However, it seems to overlook the high-frequency features(i.e., low-level texture and edge information) which are crucial to identify different intra-class object instances in the tracking task. To address this issue, we propose a transformer based tracker via frequency fusion perspective that investigated whether high-frequency and low-frequency features can be effectively combined to achieve robust tracking. Specifically, we design a simple yet effective two-stage fusion strategy and use an appropriate frequency fusion strategy in tracking process of each stage so as to make full use of frequency domain information. In the feature extraction stage, we use wavelet decomposition of high-frequency subbands to solve the performance loss caused by the transformer’s catastrophic forgetting of high-frequency information. In the prediction head stage, we use a variety of wavelet decomposition subbands to model the multi-frequency information. The two-stage fusion strategy makes our model extract more balanced and beneficial multi-frequency information, enabling it to effectively capture target texture information and local edge information while also being sensitive to global information. Extensive experiments on six challenging benchmarks (i.e., LaSOText, UAV123, TNL2K, LaSOT, TrackingNet, and GOT-10k) demonstrates the superior performance of our tracker.
engineering, electrical & electronic
What problem does this paper attempt to address?