Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

Wenjia Jiang,Xiaoke Zhu,Jiakang Gao,Di Liao
2024-11-17
Abstract:Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. Spatial-temporal information in videos is crucial, but the accuracy of spatial-temporal information is often influenced by issues like low quality and occlusions in videos. Existing methods mainly focus on reducing modality differences, but pay limited attention to improving spatial-temporal features, particularly for infrared videos. To address this, we propose a novel Skeleton-guided spatial-Temporal feAture leaRning (STAR) method for VVI-ReID. By using skeleton information, which is robust to issues such as poor image quality and occlusions, STAR improves the accuracy of spatial-temporal features in videos of both modalities. Specifically, STAR employs two levels of skeleton-guided strategies: frame level and sequence level. At the frame level, the robust structured skeleton information is used to refine the visual features of individual frames. At the sequence level, we design a feature aggregation mechanism based on skeleton key points graph, which learns the contribution of different body parts to spatial-temporal features, further enhancing the accuracy of global features. Experiments on benchmark datasets demonstrate that STAR outperforms state-of-the-art methods. Code will be open source soon.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two major challenges in video visible - infrared person re - identification (VVI - ReID): 1. **Modality Discrepancies**: - There are significant feature differences between visible - light and infrared images, which make direct matching difficult. These differences include lighting conditions, background noise, etc. 2. **Spatial - Temporal Information Extraction**: - The spatial - temporal information in videos is crucial for accurately identifying pedestrians, but problems such as low - quality videos and occlusions can seriously affect the accuracy of spatial - temporal information. Most of the existing methods focus on reducing modality differences, but have limited investment in improving spatio - temporal features, especially when dealing with infrared videos. To solve these problems, the authors propose a skeleton - guided spatio - temporal feature learning method (Skeleton - guided spatial - Temporal feAture leaRning, STAR) to improve the accuracy of spatio - temporal features. The STAR method enhances the visual features of video frames by introducing skeleton information and further optimizes feature extraction through the following two strategies: - **Frame - level Skeleton Guidance**: Use structured skeleton information to correct the visual features of a single frame so that it can maintain high accuracy even in low - quality or partially occluded situations. - **Sequence - level Skeleton Guidance**: Design a feature aggregation mechanism based on skeleton key - point graphs to learn the contributions of different body parts to spatio - temporal features, thereby further improving the accuracy of global features. Through these improvements, the experimental results of the STAR method on multiple benchmark datasets show that its performance is better than existing methods, especially in cross - modal person re - identification tasks. ### Summary By introducing skeleton information, this paper solves the problems of modality differences and spatio - temporal information extraction in video visible - infrared person re - identification, and improves the accuracy and robustness of identification.