Exploring the Complementarity Between Convolution and Transformer Matching for Visual Tracking

Zheng'ao Wang,Ming Li,Wenjie Pei,Guangming Lu,Fanglin Chen
DOI: https://doi.org/10.1016/j.knosys.2024.112184
IF: 8.139
2024-01-01
Knowledge-Based Systems
Abstract:The essence of Siamese trackers is the similarity matching between a target template deep feature and a search region deep feature. With the successful application of the Transformer in the vision community, the similarity matching manner is moving from convolution matching to Transformer matching. While this transition achieves a performance boost, we explore that there exists an intuitive complementarity between convolution matching and Transformer matching. Therefore, employing only one of the two matchings is suboptimal for the trackers, and exploiting their complementarity holds great potential. To this end, we present a Matching Knowledge Fusion (MKF) module that efficiently integrates a convolution matching and an enhanced Transformer matching to exploit the explored matching complementarity. Furthermore, aiming at the issue that the noisy and ambiguous attention weights of Transformer matching lead to the degradation of matching results, a novel mechanism of utilizing complementary matching knowledge to correct the attention weights is proposed. Based on the Matching Knowledge Fusion module, we build a simple but effective tracker, dubbed MKFTrack. Extensive experiments demonstrate the favorable performance of our tracker against state-of-the-art ones.
What problem does this paper attempt to address?