STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking

Jianbo Ma,Chuanming Tang,Fei Wu,Can Zhao,Jianlin Zhang,Zhiyong Xu
2024-09-17
Abstract:Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at <a class="link-external link-https" href="https://github.com/ydhcg-BoBo/STCMOT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in the multiple - object - tracking (MOT) task based on unmanned - aerial - vehicle (UAV) videos. Specifically, the author points out that current MOT methods mainly rely on accurate object - detection results and precise object re - identification (ReID), but often overlook temporal cues when modeling object relationships, especially performing poorly in the face of complex tracking conditions such as object deformation and blurring. To address these challenges, the author proposes a novel spatio - temporal - cohesion - multiple - object - tracking (STCMOT) framework. This framework enhances the robustness and accuracy of object tracking by using historical embedding features to model the sequential representation of ReID and detection features. ### Main problems and solutions 1. **Optimization of spatial attributes while ignoring temporal cues**: - Current methods ignore the importance of temporal cues for modeling object relationships when optimizing the spatial attributes of objects. - **Solution**: The temporal - embedding - boosting - module (TEBM) is introduced. It generates channel - level descriptors by combining the ReID feature maps of adjacent frames to highlight the distinctiveness of individual embeddings. 2. **Performance degradation under complex tracking conditions**: - Under complex conditions such as object deformation and blurring, the performance of existing methods is easily affected. - **Solution**: The temporal - detection - refinement - module (TDRM) is designed. It improves detection performance by propagating trajectory embeddings and mining significant object positions in the time domain. 3. **Resource consumption and efficiency issues**: - Traditional two - stage tracking frameworks require different networks for object detection and embedding extraction respectively, resulting in high storage costs and large resource consumption. - **Solution**: STCMOT adopts a one - shot tracking framework, integrating the detection branch and the ReID branch into a unified framework, balancing tracking performance and speed. ### Experimental results The experimental results show that STCMOT achieves new state - of - the - art performance on the VisDrone2019 and UAVDT datasets and performs excellently in both MOTA and IDF1 metrics. This proves the effectiveness and superiority of STCMOT in handling the multiple - object - tracking task based on UAV videos. ### Summary This paper solves the problem of insufficient performance of existing MOT methods in complex scenarios by introducing spatio - temporal cues and enhancing feature representation, providing a more efficient and accurate solution for multiple - object - tracking in UAV videos.