Associate Everything Detected: Facilitating Tracking-by-Detection to the Unknown

Zimeng Fang,Chao Liang,Xue Zhou,Shuyuan Zhu,Xi Li
2024-09-14
Abstract:Multi-object tracking (MOT) emerges as a pivotal and highly promising branch in the field of computer vision. Classical closed-vocabulary MOT (CV-MOT) methods aim to track objects of predefined categories. Recently, some open-vocabulary MOT (OV-MOT) methods have successfully addressed the problem of tracking unknown categories. However, we found that the CV-MOT and OV-MOT methods each struggle to excel in the tasks of the other. In this paper, we present a unified framework, Associate Everything Detected (AED), that simultaneously tackles CV-MOT and OV-MOT by integrating with any off-the-shelf detector and supports unknown categories. Different from existing tracking-by-detection MOT methods, AED gets rid of prior knowledge (e.g. motion cues) and relies solely on highly robust feature learning to handle complex trajectories in OV-MOT tasks while keeping excellent performance in CV-MOT tasks. Specifically, we model the association task as a similarity decoding problem and propose a sim-decoder with an association-centric learning mechanism. The sim-decoder calculates similarities in three aspects: spatial, temporal, and cross-clip. Subsequently, association-centric learning leverages these threefold similarities to ensure that the extracted features are appropriate for continuous tracking and robust enough to generalize to unknown categories. Compared with existing powerful OV-MOT and CV-MOT methods, AED achieves superior performance on TAO, SportsMOT, and DanceTrack without any prior knowledge. Our code is available at <a class="link-external link-https" href="https://github.com/balabooooo/AED" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two key problems in the field of multi - object tracking (MOT): 1. **The gap between closed - vocabulary MOT (CV - MOT) and open - vocabulary MOT (OV - MOT)**: - Closed - vocabulary MOT methods can only track objects of predefined categories, such as people, cars, etc. These methods perform well when dealing with known categories but perform poorly when encountering unknown categories. - Open - vocabulary MOT methods can adapt to a wider range of categories, including those not seen during training, but the tracking effect on certain specific categories is not as good as that of fine - tuned closed - vocabulary MOT methods. 2. **The dependence of existing MOT methods on prior knowledge**: - Existing MOT methods usually rely on motion cues or other prior knowledge to achieve object association, which may lead to performance degradation when dealing with complex motion patterns or unknown categories. To solve these problems, the author proposes a unified framework - **Associate Everything Detected (AED)**. AED solves the above problems in the following ways: - **Unifying CV - MOT and OV - MOT tasks**: AED can handle both closed - vocabulary and open - vocabulary MOT tasks within the same framework and support the tracking of unknown categories. - **Reducing the dependence on prior knowledge**: AED only relies on powerful feature learning to handle complex trajectories without relying on prior knowledge such as motion cues. - **Introducing an association - center - learning mechanism**: To ensure that the extracted features are suitable for continuous tracking and can be generalized to unknown categories, AED designs a sim - decoder and combines an association - center - learning mechanism to calculate similarities from three aspects: space, time, and across segments. Specifically, AED models the association task as a similarity - decoding problem and calculates the similarity between object queries and trajectory queries through the sim - decoder. In addition, AED uses the contrast - learning method to enhance the spatio - temporal consistency and long - term ID consistency of the model during training, thereby improving the tracking performance. In summary, the goal of AED is to provide a general and robust multi - object - tracking solution that can handle both known and unknown - category targets simultaneously and reduce the dependence on prior knowledge.