Abstract:Multi-object Tracking (MOT) is very important in human surveillance, sports analytics, autonomous driving, and cooperative robots. Current MOT methods do not perform well in non-uniform movements, occlusion and appearance-reappearance scenarios. We introduce a comprehensive MOT method that seamlessly merges object detection and identity linkage within an end-to-end trainable framework, designed with the capability to maintain object links over a long period of time. Our proposed model, named STMMOT, is architectured around 4 key modules: (1) Candidate proposal creation network, generates object proposals via vision-Transformer encoder-decoder architecture; (2) Scale variant pyramid, progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; (3) Spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and (4) Spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical object state observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT resides in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with an attention mechanism and eradicates the need for post-processing. Experimental results show that STMMOT archives scores of 79.8 and 78.4 for IDF1, 79.3 and 74.1 for MOTA, 73.2 and 69.0 for HOTA, 61.2 and 61.5 for AssA, and maintained an ID switch count of 1529 and 1264 on MOT17 and MOT20, respectively. When evaluated on MOT20, it scored 78.4 in IDF1, 74.1 in MOTA, 69.0 in HOTA, and 61.5 in AssA, and kept the ID switch count to 1264. Compared with the previous best TransMOT, STMMOT achieves around a 4.58% and 4.25% increase in IDF1, and ID switching reduction to 5.79% and 21.05% on MOT17 and MOT20, respectively.

STMT: Spatio-temporal memory transformer for multi-object tracking

RASTMTrack: Robust and Adaptive Space-Time Memory Networks for Visual Tracking

Exploit Spatiotemporal Contextual Information for 3D Single Object Tracking Via Memory Networks

TransLink: Transformer-Based Embedding for Tracklets’ Global Link

Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking

Exploit the Connectivity: Multi-Object Tracking with TrackletNet

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

Split and Connect: A Universal Tracklet Booster for Multi-Object Tracking

Exploit the Connectivity

STMMOT: Advancing multi-object tracking through spatiotemporal memory networks and multi-scale attention pyramids

Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment

MAT: Motion-Aware Multi-Object Tracking

MOTR: End-to-End Multiple-Object Tracking with Transformer

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

STURE: Spatial-Temporal Mutual Representation Learning for Robust Data Association in Online Multi-Object Tracking

PuTR: A Pure Transformer for Decoupled and Online Multi-Object Tracking

ETTrack: Enhanced Temporal Motion Predictor for Multi-Object Tracking

STT: Stateful Tracking with Transformers for Autonomous Driving

FastTrackTr:Towards Fast Multi-Object Tracking with Transformers

STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking