Abstract:We propose a new visual hierarchical representation paradigm for multi-object tracking. It is more effective to discriminate between objects by attending to objects' compositional visual regions and contrasting with the background contextual information instead of sticking to only the semantic visual cue such as bounding boxes. This compositional-semantic-contextual hierarchy is flexible to be integrated in different appearance-based multi-object tracking methods. We also propose an attention-based visual feature module to fuse the hierarchical visual representations. The proposed method achieves state-of-the-art accuracy and time efficiency among query-based methods on multiple multi-object tracking benchmarks.
What problem does this paper attempt to address?
This paper attempts to solve the problem of object discrimination in multi - object tracking (MOT). Specifically, the author proposes a new visual hierarchical representation method, aiming to improve the discrimination between objects by fusing visual information in different spatial regions. Traditional multi - object tracking methods mainly rely on semantic visual cues such as bounding boxes, but this method is prone to mis - matching when facing objects with similar appearances. To overcome this problem, this paper proposes a visual representation framework consisting of three levels: composition, semantics, and context, in order to more effectively distinguish different objects.
### Main contributions of the paper:
1. **Visual hierarchical representation**: A new method for generating more discriminative visual representations without additional annotations is proposed. This method enhances the representation ability of objects by extracting features from different components of the object and the background context.
2. **Attention mechanism module**: A module based on the attention mechanism (CSC - Attention) is designed to fuse the features of these three levels, thereby further improving the accuracy of object discrimination.
3. **Transformer - based tracker**: A Transformer - based multi - object tracker (CSC - Tracker) is constructed. This tracker utilizes the above - mentioned innovations and achieves state - of - the - art accuracy and time efficiency in multiple multi - object tracking benchmark tests.
### Method overview:
- **Overall architecture**: CSC - Tracker adopts the spatio - temporal global association paradigm, which mainly includes three stages: detection and feature extraction, generation of feature tokens by the CSC - Attention module, and global association.
- **CSC - Attention module**: Through self - attention and cross - attention mechanisms, the features of the three levels of composition, semantics, and context are fused to generate the final feature tokens.
- **Training and inference**: During the training process, the association probability of detections belonging to the same trajectory is maximized, and at the same time, triplet loss is introduced to increase the feature distance between positive and negative samples. During inference, online tracking is carried out in a sliding - window manner.
### Experimental results:
- **Benchmark tests**: Experiments were carried out on multiple datasets such as MOT17, MOT20, and DanceTrack. CSC - Tracker achieved state - of - the - art performance on multiple metrics, especially outstanding on metrics such as HOTA and AssA.
- **Ablation study**: By comparing the effects of factors such as different video clip lengths, input image sizes, and detector selections on performance, the effectiveness and robustness of the proposed method were verified.
### Conclusion:
The CSC - Tracker proposed in this paper significantly improves the object discrimination ability in multi - object tracking tasks by introducing new visual hierarchical representations and attention mechanisms, especially when dealing with objects with similar appearances. This method not only reaches a new level in performance but is also more cost - effective in terms of computational resource requirements.