IP-MOT: Instance Prompt Learning for Cross-Domain Multi-Object Tracking

Run Luo,Zikai Song,Longze Chen,Yunshui Li,Min Yang,Wei Yang
2024-10-30
Abstract:Multi-Object Tracking (MOT) aims to associate multiple objects across video frames and is a challenging vision task due to inherent complexities in the tracking environment. Most existing approaches train and track within a single domain, resulting in a lack of cross-domain generalizability to data from other domains. While several works have introduced natural language representation to bridge the domain gap in visual tracking, these textual descriptions often provide too high-level a view and fail to distinguish various instances within the same class. In this paper, we address this limitation by developing IP-MOT, an end-to-end transformer model for MOT that operates without concrete textual descriptions. Our approach is underpinned by two key innovations: Firstly, leveraging a pre-trained vision-language model, we obtain instance-level pseudo textual descriptions via prompt-tuning, which are invariant across different tracking scenes; Secondly, we introduce a query-balanced strategy, augmented by knowledge distillation, to further boost the generalization capabilities of our model. Extensive experiments conducted on three widely used MOT benchmarks, including MOT17, MOT20, and DanceTrack, demonstrate that our approach not only achieves competitive performance on same-domain data compared to state-of-the-art models but also significantly improves the performance of query-based trackers by large margins for cross-domain inputs.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the cross-domain generalization problem in Multi-Object Tracking (MOT). Specifically, most existing methods are trained and tracked within a single domain, leading to a lack of generalization ability on data from other domains. Although some works try to bridge the gap between domains by introducing natural language representations, these text descriptions are often too high-level to distinguish different instances of the same category. To overcome this limitation, the paper proposes IP-MOT (Instance Prompt Learning for Cross-Domain Multi-Object Tracking), a Transformer-based end-to-end model that performs multi-object tracking without the need for specific text descriptions. The main innovations of this method include: 1. **Instance-level pseudo-text descriptions**: Utilizing pre-trained vision-language models (such as CLIP), instance-level pseudo-text descriptions that are invariant across different tracking scenarios are obtained through prompt tuning. 2. **Query balancing strategy**: A query balancing strategy enhanced by knowledge distillation is employed to further improve the model's generalization ability. Experimental results show that IP-MOT is not only competitive with existing state-of-the-art models on same-domain data but also significantly outperforms query-based trackers on cross-domain inputs.