Abstract:We introduce CoTracker, a transformer-based model that tracks a large number of 2D points in long video sequences. Differently from most existing approaches that track points independently, CoTracker tracks them jointly, accounting for their dependencies. We show that joint tracking significantly improves tracking accuracy and robustness, and allows CoTracker to track occluded points and points outside of the camera view. We also introduce several innovations for this class of trackers, including using token proxies that significantly improve memory efficiency and allow CoTracker to track 70k points jointly and simultaneously at inference on a single GPU. CoTracker is an online algorithm that operates causally on short windows. However, it is trained utilizing unrolled windows as a recurrent network, maintaining tracks for long periods of time even when points are occluded or leave the field of view. Quantitatively, CoTracker substantially outperforms prior trackers on standard point-tracking benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accurately track a large number of 2D points in long - video sequences, especially in the presence of occlusion or when points leave the camera's field of view. Most existing point - tracking methods track each point independently, while this paper proposes a new method - CoTracker, which can jointly track a large number of points and utilize the dependencies between points to improve the accuracy and robustness of tracking. Specifically, CoTracker solves this problem through the following innovations: 1. **Joint Tracking**: Unlike most existing methods, CoTracker does not track each point independently but jointly considers the dependencies between points, which significantly improves the accuracy and robustness of tracking, especially when points are occluded or leave the camera's field of view. 2. **Support Points**: Additional support points are introduced to provide more context information. These support points expand the context of the tracker, similar to the use of context in visual object tracking. 3. **Proxy Tokens**: To reduce the memory complexity of the model, CoTracker introduces the concept of proxy tokens. These tokens are processed like a small number of additional trajectories during processing, thus converting the expensive self - attention mechanism into an efficient cross - attention mechanism, enabling CoTracker to jointly track nearly - dense trajectories on a single GPU simultaneously. 4. **Recursive Training Strategy**: CoTracker operates online in a sliding - window manner and optimizes the recursively applied network through an unrolled training strategy to maintain long - term tracking performance, even when points are occluded or leave the field of view for a long time. Through these innovations, CoTracker significantly outperforms previous trackers in standard point - tracking benchmark tests, especially in long - term tracking and occlusion handling.

CoTracker: It is Better to Track Together

CoTracker: It is Better to Track Together

CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos

High-speed Tracking with Multi-Templates Correlation Filters

APPTracker Plus : Displacement Uncertainty for Occlusion Handling in Low-Frame-Rate Multiple Object Tracking

APPTracker: Improving Tracking Multiple Objects in Low-Frame-Rate Videos

Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

Multi-Timescale Collaborative Tracking

ODTrack: Online Dense Temporal Token Learning for Visual Tracking

Tracking Objects as Points

Symbiotic Tracker Ensemble Toward A Unified Tracking Framework

CVTrack: Combined Convolutional Neural Network and Vision Transformer Fusion Model for Visual Tracking

Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking

STTracker: Spatio-Temporal Tracker for 3D Single Object Tracking

Local All-Pair Correspondence for Point Tracking

EasyTrack: Efficient and Compact One-stream 3D Point Clouds Tracker

Joint Feature Correspondences and Appearance Similarity for Robust Visual Object Tracking

VideoTrack: Learning to Track Objects Via Video Transformer

Robust Visual Tracking Using Multi-Frame Multi-Feature Joint Modeling.

SpatialTracker: Tracking Any 2D Pixels in 3D Space