Bidirectional Interaction of CNN and Transformer Feature for Visual Tracking

Baozhen Sun,Zhenhua Wang,Shilei Wang,Yongkang Cheng,Jifeng Ning
DOI: https://doi.org/10.1109/tcsvt.2024.3376690
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Empowered by the sophisticated long-range dependency modeling ability of Transformer, tracking performance has seen a dynamic increase in recent years. Approaches in this vein leverage the Transformer feature to integrate the information of target and search regions while neglecting the superior local representation extracted by their CNN backbone. To address this, we introduce a BIdirectional inTeraction mechanism between CNN and Transformer features for visual tracking, termed BIT-Tracker, which admits a comprehensive fusion of local and global representations, and thus boosts tracking performance. The first ingredient of BIT-Tracker is an aggregation of multi-level Transformer features to achieve a better global modeling ability. In order to combine the merits of both local and global representations, our second ingredient performs a bi-directional interaction between CNN and Transformer features, where the interaction is achieved via either querying the CNN feature from the Transformer feature or querying the Transformer feature from the CNN feature. Afterwards, the outputs from both directions are fused to predict the temporal locations of targets. Extensive experiments demonstrate the effectiveness of the proposed feature aggregation and bi-directional interaction modules. Impressively, BIT-Tracker achieves leading performance on eight tracking benchmarks and outperforms SOTA results by salient margins. Code will be made available.
engineering, electrical & electronic
What problem does this paper attempt to address?