Feature-comparison Network for Visual Tracking

Cui Zhiyan,Lu Na
DOI: https://doi.org/10.1007/s10489-023-04466-y
IF: 5.3
2023-01-01
Applied Intelligence
Abstract:Siamese networks for visual tracking have been widely applied due to their good performance. However, the performance of Siamese networks relies on the selection of several hyperparameters, including the cosine window weight and target scale penalty. Inappropriate parameter selection will lead to biased target localization and unsteady tracking. The parameter selection is dataset-specific and time-consuming. The necessity of these parameters is caused by the diffused and background-interfered target response map. In addition, the comparison between the target template and candidates in Siamese networks is performed by a simple inner product, which is linear, unbounded, covariate shifted, and cannot benefit the learning of target-background discriminant features. To address the above issues, a novel feature-comparison network (FCNet) has been developed, which combines a feature extraction network and a feature comparison network. First, an RoIAlign layer is incorporated for efficient target proposal generation. Then, the Siamese structure is borrowed to form the feature extraction network but with a different network architecture. Instead of the simple inner product in Siamese networks, a feature concatenation and comparison structure have been adopted for sample feature similarity evaluation, which has combined several convolutional and fully-connected layers for similarity computation. The comparison network, which is nonlinear, bounded and covariate unshifted, performs more efficient correlation computation and provides similarity feedback for target-background discriminant feature learning with stronger representation and generalization. A more compact and target-dominant response map has been obtained by FCNet, which assures robust and steady tracking. Experiments on benchmarks OTB2013, OTB2015, VOT2016 and UAV123 show that FCNet has obtained state-of-the-art real-time tracking performance with 30 FPS. The code and models will be available on GitHub.
What problem does this paper attempt to address?