Abstract:High computational power and significant time are usually needed to train a deep learning based tracker on large datasets. Depending on many factors, training might not always be an option. In this paper, we propose a framework with two ideas on Siamese-based trackers. (i) Extending number of templates in a way that removes the need to retrain the network and (ii) a lightweight temporal network with a novel architecture focusing on both local and global information that can be used independently from trackers. Most Siamese-based trackers only rely on the first frame as the ground truth for objects and struggle when the target's appearance changes significantly in subsequent frames in presence of similar distractors. Some trackers use multiple templates which mostly rely on constant thresholds to update, or they replace those templates that have low similarity scores only with more similar ones. Unlike previous works, we use adaptive thresholds that update the bag with similar templates as well as those templates which are slightly diverse. Adaptive thresholds also cause an overall improvement over constant ones. In addition, mixing feature maps obtained by each template in the last stage of networks removes the need to retrain trackers. Our proposed lightweight temporal network, CombiNet, learns the path history of different objects using only object coordinates and predicts target's potential location in the next frame. It is tracker independent and applying it on new trackers does not need further training. By implementing these ideas, trackers' performance improved on all datasets tested on, including LaSOT, LaSOT extension, TrackingNet, OTB100, OTB50, UAV123 and UAV20L. Experiments indicate the proposed framework works well with both convolutional and transformer-based trackers. The official python code for this paper will be publicly available upon publication.

Learning Deep Lucas-Kanade Siamese Network for Visual Tracking

Learning Temporal-Correlated and Channel- Decorrelated Siamese Networks for Visual Tracking

DASTSiam: Spatio‐temporal Fusion and Discriminative Enhancement for Siamese Visual Tracking

Discriminative and Robust Online Learning for Siamese Visual Tracking

Learning Localization-aware Target Confidence for Siamese Visual Tracking

Learning Motion-Perceive Siamese network for robust visual object tracking

Siamese Residual Network for Efficient Visual Tracking

R-SiamNet: ROI-Align Pooling Baesd Siamese Network for Object Tracking

Improving Siamese Based Trackers with Light or No Training through Multiple Templates and Temporal Network

Deformable Siamese Attention Networks for Visual Object Tracking

Siamese Centerness Prediction Network for Real-Time Visual Object Tracking

Learning to Match Using Siamese Network for Object Tracking.

Mutual Learning and Feature Fusion Siamese Networks for Visual Object Tracking

NCSiam: Reliable Matching Via Neighborhood Consensus for Siamese-Based Object Tracking.

Visual Tracking With Siamese Network Based on Fast Attention Network

Antidecay LSTM for Siamese Tracking With Adversarial Learning

Siamese Tracking Network with Spatial-Semantic-Aware Attention and Flexible Spatiotemporal Constraint

Siamese-Based Attention Learning Networks for Robust Visual Object Tracking

SiamST: Siamese Network with Spatio-Temporal Awareness for Object Tracking

The Multi-task Fully Convolutional Siamese Network with Correlation Filter Layer for Real-Time Visual Tracking

Residual Attention SiameseRPN for Visual Tracking