Abstract:In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new framework for addressing the problem of tracking any specified point in a video (Tracking Any Point, TAP), especially when dealing with occlusion issues in long sequences. The authors argue that traditional optical flow estimation methods are ineffective in handling such problems because these methods mainly focus on the correspondence between consecutive frames and cannot effectively manage information in long sequences, particularly when the tracking point is occluded. To address this issue, the paper proposes a simple yet powerful framework based on Transformers—TAPTR (Tracking Any Point with Transformers). Inspired by algorithms like DETR (DEtection TRansformer), this framework treats tracking points as queries, with each query comprising both position and content, and improves the representation of queries through multiple layers of updates. Additionally, the framework incorporates the concept of cost volume, borrowed from optical flow estimation methods, to enhance the model's ability to capture local features, thereby improving tracking accuracy. The key contributions of the paper include: 1. **Framework Design**: A simple framework is designed that utilizes the Transformer architecture to track any point in a video, effectively handling occlusion issues and achieving excellent performance on multiple datasets. 2. **Integration of Cost Volume**: Cost volume is integrated into the Transformer decoder to provide initial visual similarity information, which helps improve the model's accuracy in locating tracking points. 3. **Long-term Information Processing**: By using a cross-frame self-attention mechanism, information is exchanged in the temporal dimension, better leveraging long-term contextual information. 4. **Window Post-processing**: To overcome memory limitations, a sliding window strategy is employed to process video sequences, and an update and padding strategy is proposed to maintain the consistency of tracking point information, mitigating the issue of feature drift. Experimental results show that TAPTR achieves state-of-the-art performance on several challenging TAP datasets, particularly surpassing existing techniques on the DAVIS dataset, while maintaining faster inference speed.

TAPTR: Tracking Any Point with Transformers as Detection

TAPTRv2: Attention-based Position Update Improves Tracking Any Point

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Event-Based Tracking Any Point with Motion-Augmented Temporal Consistency

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

PTTR: Relational 3D Point Cloud Object Tracking with Transformer

Tracking Any Point with Frame-Event Fusion Network at High Frame Rate

AnchorPoint: Query Design for Transformer-Based 3D Object Detection and Tracking

Global Tracking Transformers

Self-Supervised Any-Point Tracking by Contrastive Random Walks

DETA: A Point-Based Tracker With Deformable Transformer and Task-Aligned Learning

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling

FastTrackTr:Towards Fast Multi-Object Tracking with Transformers

PointTransformer: Encoding Human Local Features for Small Target Detection

Instance-Guided Point Cloud Single Object Tracking With Inception Transformer

BootsTAP: Bootstrapped Training for Tracking-Any-Point

Exploring Dynamic Transformer for Efficient Object Tracking

Learning Spatial-Frequency Transformer for Visual Object Tracking

3D Object Tracking with Transformer

Efficient transformer tracking with adaptive attention