Abstract:Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing.

What problem does this paper attempt to address?

This paper attempts to address several key issues in multi-object tracking (MOT), particularly in the tasks of multi-person tracking and re-identification in unconstrained environments. Specifically, the paper focuses on the following aspects: 1. **Non-uniform motion, occlusion, and object reappearance**: Existing MOT methods perform poorly in handling complex scenarios such as non-uniform motion, occlusion, and object reappearance. These factors can lead to decreased tracking accuracy and even loss of targets. 2. **Separation of detection and identity association**: Traditional MOT methods usually separate object detection and identity association into two independent stages, which leads to inefficiency and performance loss. 3. **Long-term identity maintenance**: Maintaining the identity link of objects over long-term tracking is a challenge. Existing methods often suffer from identity switch errors during long-term tracking. To address these issues, the paper proposes a new integrated MOT method—STMMOT (SpatioTemporal Multi-Object Tracking). This method is implemented through the following four key Transformer-based modules: 1. **Candidate proposal generation**: Using a visual Transformer encoder-decoder architecture to detect objects from each frame of the video. 2. **Scale variation pyramid**: Learning self-scale and cross-scale similarities in multi-scale feature maps. 3. **Spatiotemporal memory encoder**: Extracting key information from the memory associated with each tracked object. 4. **Spatiotemporal memory decoder**: Simultaneously addressing object detection and identity association tasks. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings and continuously updating these embeddings through the attention mechanism, thereby predicting object states and eliminating the need for post-processing. Experimental results show that STMMOT achieves significant performance improvements on the MOT17 and MOT20 datasets, especially in metrics such as IDF1, MOTA, HOTA, and AssA, demonstrating a clear advantage over the previous best method, TransMOT.

Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment

[Significance of cardiovascular research within the scope of the total development of medical sciences in East Germany].

TransLink: Transformer-Based Embedding for Tracklets’ Global Link

Beyond Traditional Driving Scenes: A Robotic-Centric Paradigm for 2D+3D Human Tracking Using Siamese Transformer Network

Exploit the Connectivity: Multi-Object Tracking with TrackletNet

Exploit the Connectivity

Multi-modal 3D Human Tracking for Robots in Complex Environment with Siamese Point-Video Transformer

STMT: Spatio-temporal memory transformer for multi-object tracking

MAT: Motion-Aware Multi-Object Tracking

Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

InterTrack: Interaction Transformer for 3D Multi-Object Tracking

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

TransCenter: Transformers With Dense Representations for Multiple-Object Tracking

MOTR: End-to-End Multiple-Object Tracking with Transformer

TrackFormer: Multi-Object Tracking with Transformers

Multi-object tracking algorithm based on interactive attention network and adaptive trajectory reconnection

Multiple Object Tracking as ID Prediction

PuTR: A Pure Transformer for Decoupled and Online Multi-Object Tracking

A transformer‐based lightweight method for multiple‐object tracking

FastTrackTr:Towards Fast Multi-Object Tracking with Transformers