Abstract:We propose a new visual hierarchical representation paradigm for multi-object tracking. It is more effective to discriminate between objects by attending to objects' compositional visual regions and contrasting with the background contextual information instead of sticking to only the semantic visual cue such as bounding boxes. This compositional-semantic-contextual hierarchy is flexible to be integrated in different appearance-based multi-object tracking methods. We also propose an attention-based visual feature module to fuse the hierarchical visual representations. The proposed method achieves state-of-the-art accuracy and time efficiency among query-based methods on multiple multi-object tracking benchmarks.

What problem does this paper attempt to address?

This paper attempts to solve the problem of object discrimination in multi - object tracking (MOT). Specifically, the author proposes a new visual hierarchical representation method, aiming to improve the discrimination between objects by fusing visual information in different spatial regions. Traditional multi - object tracking methods mainly rely on semantic visual cues such as bounding boxes, but this method is prone to mis - matching when facing objects with similar appearances. To overcome this problem, this paper proposes a visual representation framework consisting of three levels: composition, semantics, and context, in order to more effectively distinguish different objects. ### Main contributions of the paper: 1. **Visual hierarchical representation**: A new method for generating more discriminative visual representations without additional annotations is proposed. This method enhances the representation ability of objects by extracting features from different components of the object and the background context. 2. **Attention mechanism module**: A module based on the attention mechanism (CSC - Attention) is designed to fuse the features of these three levels, thereby further improving the accuracy of object discrimination. 3. **Transformer - based tracker**: A Transformer - based multi - object tracker (CSC - Tracker) is constructed. This tracker utilizes the above - mentioned innovations and achieves state - of - the - art accuracy and time efficiency in multiple multi - object tracking benchmark tests. ### Method overview: - **Overall architecture**: CSC - Tracker adopts the spatio - temporal global association paradigm, which mainly includes three stages: detection and feature extraction, generation of feature tokens by the CSC - Attention module, and global association. - **CSC - Attention module**: Through self - attention and cross - attention mechanisms, the features of the three levels of composition, semantics, and context are fused to generate the final feature tokens. - **Training and inference**: During the training process, the association probability of detections belonging to the same trajectory is maximized, and at the same time, triplet loss is introduced to increase the feature distance between positive and negative samples. During inference, online tracking is carried out in a sliding - window manner. ### Experimental results: - **Benchmark tests**: Experiments were carried out on multiple datasets such as MOT17, MOT20, and DanceTrack. CSC - Tracker achieved state - of - the - art performance on multiple metrics, especially outstanding on metrics such as HOTA and AssA. - **Ablation study**: By comparing the effects of factors such as different video clip lengths, input image sizes, and detector selections on performance, the effectiveness and robustness of the proposed method were verified. ### Conclusion: The CSC - Tracker proposed in this paper significantly improves the object discrimination ability in multi - object tracking tasks by introducing new visual hierarchical representations and attention mechanisms, especially when dealing with objects with similar appearances. This method not only reaches a new level in performance but is also more cost - effective in terms of computational resource requirements.

Multi-Object Tracking by Hierarchical Visual Representations

Robust Object Tracking with a Hierarchical Ensemble Framework

Multi-View People Tracking Via Hierarchical Trajectory Composition

Visual Tracking Based on Hierarchical Framework and Sparse Representation

Multi-level Visual Tracking with Hierarchical Tree Structural Constraint

Exploiting Hierarchical Dense Structures on Hypergraphs for Multi-Object Tracking

Multi-target Tracking with Hierarchical Data Association Using Main-Parts and Spatial-Temporal Feature Models

Multi Object Tracking Based on Detection with Deep Learning and Hierarchical Clustering

Multi-features Guided Robust Visual Tracking.

Robust Tracking via Multi-level Multi-feature Templates

Multi-Task Hierarchical Feature Learning for Real-Time Visual Tracking

Visual tracking based on multi-cue framework and hierarchical aggregation

Motion-guided and Occlusion-Aware Multi-Object Tracking with Hierarchical Matching

Multi-Object Tracking Hierarchically in Visual Data Taken From Drones

Scene-Adaptive Hierarchical Data Association for Multiple Objects Tracking

Object Tracking with Hierarchical Multiview Learning

Robust Visual Tracking Via Hierarchical Convolutional Features

Multi-object tracking via discriminative appearance modeling.

Multi-object Tracking by Expanding Long-Tracklets

Multi-invariance appearance model for object tracking

Multi-hierarchical Independent Correlation Filters for Visual Tracking