Leveraging Local and Global Cues for Visual Tracking Via Parallel Interaction Network

Yaozong Zheng,Bineng Zhong,Qihua Liang,Zhenjun Tang,Rongrong Ji,Xianxian Li
DOI: https://doi.org/10.1109/tcsvt.2022.3212987
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Despite that both local and context information are crucial for robust tracking, existing CNN-based and transformer-based methods mainly focus on one of these aspects. Consequently, the former fails to exploit rich global context information due to the limited receptive field, while the latter suffers from the deficiencies in constructing the local relationship among neighboring regions. To address this issue, we propose the SiamPIN tracker, based on our Parallel Interaction Network. It consists of two effective modules, namely Global Aggregation Block (GAB) and Local Process Block (LPB). GAB perceives the global context to capture the long-range spatial dependency through a transformer-based architecture. Meanwhile, LPB performs local information extraction using a CNN model to retain the detailed appearance information of the target. These two modules are connected consecutively to compose a Trans-Conv unit block, which transmits the global context information to the local feature extraction procedure, hence enables the interaction of global-local information flow. Several such blocks are cascaded so that our model can learn to aggregate local and context information interactively. The proposed tracker achieves state-of-the-art performance on six benchmark datasets, while maintaining a real time running speed.
What problem does this paper attempt to address?