Abstract:Current Siamese trackers have paid more attention to Transformer-based structures for their extraordinary improvements in accuracy through extensive information fusion and cross-attention enhancement. However, the further elevation of traditional Siamese trackers' performance is lagged by the low robustness of the interference from similar distractors. The representation ability and the discrimination against similar distractors are always incompatible. Even though the representation ability of recent Transformer-based trackers is broadly enhanced, it still causes a high response to similar distractors owing to the similarity matching mechanism of the Siamese structure. To tackle the above problems, we propose a Tracking-in-Tracking (an outer tracker with an inner tracker) pipeline (TiT) consisting of an antecedent tracking stage and a refining tracking stage. Instead of just capturing a single candidate matched with the template, perceiving all potential candidates can provide proper information on possible similar distractors. Based on this insight, a Transformer-based outer tracker is constructed to recognize all candidates in the antecedent tracking stage. Subsequently, in the refining tracking stage, an inner tracker is applied to further realize accurate object identification from all selected candidates with a designed bilateral feedback mechanism (BFM) and peak distilling module (PDM). Therefore, the Transformer-based outer tracker and Motion-estimated inner tracker can supervise each other to achieve robust tracking performance without further aggravating model complexity and memory burden. Extensive experiments have demonstrated that our TiT can serve as a unified framework to discriminate similar interference and perform state-of-the-art (SOTA) performance in mainstream benchmarks.

Motion-Driven Tracking via End-to-End Coarse-to-Fine Verifying

Track Without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking

Hierarchical Tracking by Reinforcement Learning-Based Searching and Coarse-to-Fine Verifying

Multi-features Guided Robust Visual Tracking.

Exploring Reliable Visual Tracking Via Target Embedding Network

Enhancing Discriminative Appearance Model for Visual Tracking.

Beyond Local Search: Tracking Objects Everywhere with Instance-Specific Proposals

Adaptive Part Mining for Robust Visual Tracking.

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

Robust visual tracking based on generative and discriminative model collaboration

Improve Visual Tracking by End-to-end Multi-Tracker Selection.

Robust Visual Object Tracking Based on Feature Channel Weighting and Game Theory

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

Robust Visual Tracking with Deep Feature Fusion

Tracking in tracking: An efficient method to solve the tracking distortion

Visual tracking with screening region enrichment and target validation

MotionTrack: Learning Motion Predictor for Multiple Object Tracking

MIMTracking: Masked image modeling enhanced vision transformer for visual object tracking

Robust Visual Tracking Method via Deep Learning

Robust Tracking Via Patch-Based Appearance Model and Local Background Estimation

Robust Visual Tracking Based on Hierarchical Appearance Model.