Full Attention Tracker: A Good Combination of Pixel-Level and Region-Level Cross-Correlation

Yuxuan Wang,Liping Yan,Zihang Feng,Yuanqing Xia,Bo Xiao
DOI: https://doi.org/10.23919/CCC58697.2023.10240179
2023-01-01
Abstract:The tracker based on Siamese neural network is currently a technical method with high accuracy in the tracking field. With the introduction of transformer in the visual tracking field, the attention mechanism has gradually emerged in tracking tasks. However, due to the characteristics of attention operation, Transformer usually has slow convergence speed, and its pixel-level correlation discrimination in tracking is more likely to lead to overfitting, which is not conducive to long-term tracking. A brand new framework FAT was designed, which is the improvement of MixFormer. The operation for simultaneous feature extraction and target information integration in MixFormer is retained, and the Mixing block is introduced to suppress the background as much as possible before the information interaction. In addition, a new operation is designed: the result of region-level cross-correlation is used as a guidance to help the learning of pixel-level cross-correlation in attention, thereby accelerating the model convergence speed and enhancing the model generalization. Finally, a joint loss function is designed to further improve the accuracy of the model. Experiments show that the presented tracker achieves excellent performance on five benchmark datasets.
What problem does this paper attempt to address?