Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment

Hamza Mukhtar,Muhammad Usman Ghani Khan
2023-12-19
Abstract:Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address several key issues in multi-object tracking (MOT), particularly in the tasks of multi-person tracking and re-identification in unconstrained environments. Specifically, the paper focuses on the following aspects: 1. **Non-uniform motion, occlusion, and object reappearance**: Existing MOT methods perform poorly in handling complex scenarios such as non-uniform motion, occlusion, and object reappearance. These factors can lead to decreased tracking accuracy and even loss of targets. 2. **Separation of detection and identity association**: Traditional MOT methods usually separate object detection and identity association into two independent stages, which leads to inefficiency and performance loss. 3. **Long-term identity maintenance**: Maintaining the identity link of objects over long-term tracking is a challenge. Existing methods often suffer from identity switch errors during long-term tracking. To address these issues, the paper proposes a new integrated MOT method—STMMOT (SpatioTemporal Multi-Object Tracking). This method is implemented through the following four key Transformer-based modules: 1. **Candidate proposal generation**: Using a visual Transformer encoder-decoder architecture to detect objects from each frame of the video. 2. **Scale variation pyramid**: Learning self-scale and cross-scale similarities in multi-scale feature maps. 3. **Spatiotemporal memory encoder**: Extracting key information from the memory associated with each tracked object. 4. **Spatiotemporal memory decoder**: Simultaneously addressing object detection and identity association tasks. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings and continuously updating these embeddings through the attention mechanism, thereby predicting object states and eliminating the need for post-processing. Experimental results show that STMMOT achieves significant performance improvements on the MOT17 and MOT20 datasets, especially in metrics such as IDF1, MOTA, HOTA, and AssA, demonstrating a clear advantage over the previous best method, TransMOT.