CoTracker: It is Better to Track Together

Nikita Karaev,Ignacio Rocco,Benjamin Graham,Natalia Neverova,Andrea Vedaldi,Christian Rupprecht
2024-10-01
Abstract:We introduce CoTracker, a transformer-based model that tracks a large number of 2D points in long video sequences. Differently from most existing approaches that track points independently, CoTracker tracks them jointly, accounting for their dependencies. We show that joint tracking significantly improves tracking accuracy and robustness, and allows CoTracker to track occluded points and points outside of the camera view. We also introduce several innovations for this class of trackers, including using token proxies that significantly improve memory efficiency and allow CoTracker to track 70k points jointly and simultaneously at inference on a single GPU. CoTracker is an online algorithm that operates causally on short windows. However, it is trained utilizing unrolled windows as a recurrent network, maintaining tracks for long periods of time even when points are occluded or leave the field of view. Quantitatively, CoTracker substantially outperforms prior trackers on standard point-tracking benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately track a large number of 2D points in long - video sequences, especially in the presence of occlusion or when points leave the camera's field of view. Most existing point - tracking methods track each point independently, while this paper proposes a new method - CoTracker, which can jointly track a large number of points and utilize the dependencies between points to improve the accuracy and robustness of tracking. Specifically, CoTracker solves this problem through the following innovations: 1. **Joint Tracking**: Unlike most existing methods, CoTracker does not track each point independently but jointly considers the dependencies between points, which significantly improves the accuracy and robustness of tracking, especially when points are occluded or leave the camera's field of view. 2. **Support Points**: Additional support points are introduced to provide more context information. These support points expand the context of the tracker, similar to the use of context in visual object tracking. 3. **Proxy Tokens**: To reduce the memory complexity of the model, CoTracker introduces the concept of proxy tokens. These tokens are processed like a small number of additional trajectories during processing, thus converting the expensive self - attention mechanism into an efficient cross - attention mechanism, enabling CoTracker to jointly track nearly - dense trajectories on a single GPU simultaneously. 4. **Recursive Training Strategy**: CoTracker operates online in a sliding - window manner and optimizes the recursively applied network through an unrolled training strategy to maintain long - term tracking performance, even when points are occluded or leave the field of view for a long time. Through these innovations, CoTracker significantly outperforms previous trackers in standard point - tracking benchmark tests, especially in long - term tracking and occlusion handling.