TAPTRv2: Attention-based Position Update Improves Tracking Any Point

Hongyang Li,Hao Zhang,Shilong Liu,Zhaoyang Zeng,Feng Li,Tianhe Ren,Bohan Li,Lei Zhang

2024-07-23

Abstract:In this paper, we present TAPTRv2, a Transformer-based approach built upon TAPTR for solving the Tracking Any Point (TAP) task. TAPTR borrows designs from DEtection TRansformer (DETR) and formulates each tracking point as a point query, making it possible to leverage well-studied operations in DETR-like algorithms. TAPTRv2 improves TAPTR by addressing a critical issue regarding its reliance on cost-volume,which contaminates the point queryś content feature and negatively impacts both visibility prediction and cost-volume computation. In TAPTRv2, we propose a novel attention-based position update (APU) operation and use key-aware deformable attention to realize. For each query, this operation uses key-aware attention weights to combine their corresponding deformable sampling positions to predict a new query position. This design is based on the observation that local attention is essentially the same as cost-volume, both of which are computed by dot-production between a query and its surrounding features. By introducing this new operation, TAPTRv2 not only removes the extra burden of cost-volume computation, but also leads to a substantial performance improvement. TAPTRv2 surpasses TAPTR and achieves state-of-the-art performance on many challenging datasets, demonstrating the superiority

Computer Vision and Pattern Recognition,Robotics

What problem does this paper attempt to address?

The paper primarily focuses on improving the algorithm performance in the Tracking Any Point (TAP) task, particularly for tracking specific points and visibility prediction in videos. The paper proposes a new method called TAPTRv2, which is a Transformer-based approach aimed at addressing the issues present in the previous generation TAPTR. ### Main Problems the Paper Attempts to Solve 1. **Cost-Volume Issue**: The TAPTR method relies on cost-volume features, which can contaminate the content features of point queries and negatively impact visibility prediction and cost-volume calculation. 2. **Feature Contamination Issue**: TAPTR simply concatenates cost-volume features with the content features of point queries, which not only complicates the model but also hinders optimization and learning efficiency. 3. **Need for a Simplified Framework**: Although TAPTR has shown performance improvements, it still retains the use of cost-volume, making it less concise compared to query-based object detection methods. ### Solutions - **Attention-based Position Update (APU)**: The paper introduces a new APU operation to predict the new position of each query. This operation is based on the observation that local attention is essentially the same as cost-volume, both calculating the similarity between queries and their surrounding features through dot product. - **Key-Aware Deformable Attention**: To better utilize APU, the paper employs key-aware deformable attention. This attention mechanism can explicitly compare queries with image features, thereby more accurately matching the design of APU. - **Avoiding Direct Use of Cost-Volume**: TAPTRv2 no longer directly uses cost-volume to avoid contamination of content features. Instead, it indirectly utilizes the information captured by cost-volume through the APU operation. ### Summary By introducing the APU operation and adopting key-aware deformable attention, TAPTRv2 not only addresses the issues caused by cost-volume in TAPTR but also further simplifies the overall framework and improves algorithm performance. These improvements enable TAPTRv2 to achieve state-of-the-art results on multiple challenging datasets.

TAPTRv2: Attention-based Position Update Improves Tracking Any Point

TAPTR: Tracking Any Point with Transformers as Detection

TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video

Solution for Point Tracking Task of ICCV 1st Perception Test Challenge 2023

DETA: A Point-Based Tracker With Deformable Transformer and Task-Aligned Learning

Efficient transformer tracking with adaptive attention

Solution for Point Tracking Task of ECCV 2nd Perception Test Challenge 2024

Exploring Point-BEV Fusion for 3D Point Cloud Object Tracking with Transformer

BootsTAP: Bootstrapped Training for Tracking-Any-Point

Event-Based Tracking Any Point with Motion-Augmented Temporal Consistency

AnchorPoint: Query Design for Transformer-Based 3D Object Detection and Tracking

PTTR: Relational 3D Point Cloud Object Tracking with Transformer

Point Transformer V3: Simpler, Faster, Stronger

Target-Aware Tracking with Long-term Context Attention

Target-point Attention Transformer: A novel trajectory predict network for end-to-end autonomous driving

Self-Supervised Any-Point Tracking by Contrastive Random Walks

TAPVid-3D: A Benchmark for Tracking Any Point in 3D

Exploiting spatial relationships for visual tracking

AMTrack:Transformer tracking via action information and mix-frequency features

Learning Cross-Attention Point Transformer With Global Porous Sampling

Position-Guided Point Cloud Panoptic Segmentation Transformer