IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking

Run Luo,Zikai Song,Longze Chen,Yunshui Li,Min Yang,Wei Yang

2024-10-30

Abstract:Multi-Object Tracking (MOT) aims to associate multiple objects across video frames and is a challenging vision task due to inherent complexities in the tracking environment. Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability to data from other domains. While several works have introduced natural language representation to bridge the domain gap in visual tracking, these textual descriptions often provide too high-level a view and fail to distinguish various instances within the same class. In this paper, we address this limitation by developing IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions. Our approach is underpinned by two key innovations: Firstly, leveraging a pre-trained vision-language model, we obtain instance-level pseudo textual descriptions via prompt-tuning, which are invariant across different tracking scenes; Secondly, we introduce a query-balanced strategy, augmented by knowledge distillation, to further boost the generalization capabilities of our model. Extensive experiments conducted on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach not only achieves competitive performance on same-domain data compared to state-of-the-art models but also significantly improves the performance of query-based trackers by large margins for cross-domain inputs.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper attempts to address the cross-domain generalization problem in Multi-Object Tracking (MOT). Specifically, most existing methods are trained and tracked within a single domain, leading to a lack of generalization ability on data from other domains. Although some works try to bridge the gap between domains by introducing natural language representations, these text descriptions are often too high-level to distinguish different instances of the same category. To overcome this limitation, the paper proposes IP-MOT (Instance Prompt Learning for Cross-Domain Multi-Object Tracking), a Transformer-based end-to-end model that performs multi-object tracking without the need for specific text descriptions. The main innovations of this method include: 1. **Instance-level pseudo-text descriptions**: Utilizing pre-trained vision-language models (such as CLIP), instance-level pseudo-text descriptions that are invariant across different tracking scenarios are obtained through prompt tuning. 2. **Query balancing strategy**: A query balancing strategy enhanced by knowledge distillation is employed to further improve the model's generalization ability. Experimental results show that IP-MOT is not only competitive with existing state-of-the-art models on same-domain data but also significantly outperforms query-based trackers on cross-domain inputs.

IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking

Exploit the Connectivity: Multi-Object Tracking with TrackletNet

Split and Connect: A Universal Tracklet Booster for Multi-Object Tracking

Exploit the Connectivity

Generalizing Multiple Object Tracking to Unseen Domains by Introducing Natural Language Representation

MAT: Motion-Aware Multi-Object Tracking

Multi-Granularity Language-Guided Multi-Object Tracking

Multiple Object Tracking as ID Prediction

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

LaMOT: Language-Guided Multi-Object Tracking

Part‐MOT: A Multi‐object Tracking Method with Instance Part‐based Embedding

IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency

UTOPIA: Unconstrained Tracking Objects without Preliminary Examination via Cross-Domain Adaptation

MotionTrack: Learning Robust Short-term and Long-term Motions for Multi-Object Tracking

CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking With Camera-LiDAR Fusion

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

Multi-object Tracking via Discriminative Embeddings for the Internet of Things

DIOR - DIstill Observations to Representations for Multi-Object Tracking and Segmentation.

[Significance of cardiovascular research within the scope of the total development of medical sciences in East Germany].

MOTR: End-to-End Multiple-Object Tracking with Transformer

Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking