Abstract:Object detection has long been a topic of high interest in computer vision literature. Motivated by the fact that annotating data for the multi-object tracking (MOT) problem is immensely expensive, recent studies have turned their attention to the unsupervised learning setting. In this paper, we push forward the state-of-the-art performance of unsupervised MOT methods by proposing UnsMOT, a novel framework that explicitly combines the appearance and motion features of objects with geometric information to provide more accurate tracking. Specifically, we first extract the appearance and motion features using CNN and RNN models, respectively. Then, we construct a graph of objects based on their relative distances in a frame, which is fed into a GNN model together with CNN features to output geometric embedding of objects optimized using an unsupervised loss function. Finally, associations between objects are found by matching not only similar extracted features but also geometric embedding of detections and tracklets. Experimental results show remarkable performance in terms of HOTA, IDF1, and MOTA metrics in comparison with state-of-the-art methods.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of expensive and time - consuming labeled data in multi - object tracking (MOT). Specifically, the paper proposes a framework named UnsMOT, which combines the appearance features, motion features and geometric topological information of objects to achieve more accurate object tracking without the need for labeled data. The following are the main contributions of the paper: 1. **Unsupervised learning**: The paper proposes an unsupervised learning method, which avoids the need for a large amount of labeled data, thus reducing costs and time consumption. 2. **Combination of multiple features**: The UnsMOT framework not only utilizes the appearance features and motion features of objects, but also introduces geometric topological information, and models the relative position relationships between objects through graph neural networks (GNN). 3. **Performance improvement**: Experimental results show that UnsMOT is significantly superior to existing unsupervised methods in multiple evaluation metrics (such as HOTA, IDF1 and MOTA), and in some cases even outperforms supervised methods. ### Main methods of the paper 1. **Feature extraction**: - **Appearance features**: Use a convolutional neural network (CNN) to extract appearance features from the detected object images. - **Motion features**: Use a recurrent neural network (RNN) to extract motion features from the object's bounding boxes and the RNN hidden state of the previous frame. 2. **Graph construction**: - Construct a graph structure based on the relative distances of objects in the current frame, where each node represents a detected object and the edge weights are determined by the Euclidean distances between objects. 3. **Graph neural network (GNN)**: - Input the constructed graph and the appearance features extracted by CNN into the GNN model to obtain the geometric embedding representation of objects. - Optimize the GNN model through an unsupervised loss function so that it can learn the topological relationships between objects. 4. **Object association**: - Achieve object association by matching the similar features (including appearance features, motion features and geometric embeddings) between the detected objects and historical trajectories. - Combine different types of similarity scores by weighted summation, and finally obtain the object association matrix. ### Experimental results The paper has carried out extensive experiments on three datasets, MOT16, MOT17 and MOT20. The results show that UnsMOT performs excellently in multiple evaluation metrics such as HOTA, IDF1 and MOTA, is significantly superior to other unsupervised methods, and in some cases even outperforms supervised methods. ### Conclusion The UnsMOT framework has achieved significant performance improvement in the unsupervised multi - object tracking task by combining the appearance features, motion features and geometric topological information of objects, and has effectively solved the problem of expensive and time - consuming labeled data.

UnsMOT: Unified Framework for Unsupervised Multi-Object Tracking with Geometric Topology Guidance

Uncertainty-aware Unsupervised Multi-Object Tracking

Exploit the Connectivity: Multi-Object Tracking with TrackletNet

Exploit the Connectivity

Robust Unsupervised Multi-Object Tracking in Noisy Environments

Object-Level Pseudo-3D Lifting for Distance-Aware Tracking

MAT: Motion-Aware Multi-Object Tracking

Multimodal Multiobject Tracking by Fusing Deep Appearance Features and Motion Information

Poly-MOT: A Polyhedral Framework for 3D Multi-Object Tracking.

Multi-Object Tracking by Self-supervised Learning Appearance Model.

SOT for MOT

Spatial-Semantic and Temporal Attention Mechanism-Based Online Multi-Object Tracking

Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism

STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking

Multi-object tracking with deep learning ensemble for unmanned aerial system applications

Object-Centric Multiple Object Tracking

Online Multi-Object Tracking With Visual and Radar Features

Pixel-Guided Association for Multi-Object Tracking

Multiple Object Tracking in Deep Learning Approaches: A Survey

Track Initialization and Re-Identification for~3D Multi-View Multi-Object Tracking

Multi-object tracking via deep feature fusion and association analysis