Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification
Yujian Feng,Feng Chen,Jian Yu,Yimu Ji,Fei Wu,Tianliang Liu,Shangdong Liu,Xiao-Yuan Jing,Jiebo Luo
DOI: https://doi.org/10.1109/tmm.2024.3354575
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Video-based visible-infrared person re-identification (VVI-ReID) aims to match the identity of a person captured in video sequences from both visible and infrared cameras. The VVI-ReID task requires considering both the spatial relationship between body parts within each frame and the temporal change of appearance between successive frames. Existing VVI Re-ID methods employ Convolutional Neural Networks to extract local spatial features and Long Short-Term Memory to form temporal associations. However, these methods can not effectively capture the global spatial feature and the long-range temporal dependencies in ultra-long sequences. In this paper, we propose a Cross-modality Spatial-temporal Transformer (CST) including a Cross-frame Tube Transformer Module (CTTM) and a Multi-frame Transformer Fusion Module (MTFM) to address these challenges. Firstly, CTTM tokenizes a video clip into multiple 3D tubes, each encapsulating local spatial-temporal information of pedestrians, and then obtains global spatial-temporal representations by establishing the relationship between tubes. Secondly, we design MTFM to exchange information between multiple frames using message tokens, thus modeling the long-range temporal dependencies of features of pedestrians. In addition, to prevent the potential representation collapse caused by triplet-based loss functions, we propose a diversity-consistency (DC) loss function to preserve the diversity and consistency of cross-modality feature representations by imposing variance, invariance, and covariance constraints in feature representations. Extensive benchmark experiments demonstrate that our approach outperforms the state-of-the-art methods with large margins.
computer science, information systems,telecommunications, software engineering