Abstract:Video-based visible-infrared person re-identification (VVI-ReID) is challenging due to significant modality feature discrepancies. Spatial-temporal information in videos is crucial, but the accuracy of spatial-temporal information is often influenced by issues like low quality and occlusions in videos. Existing methods mainly focus on reducing modality differences, but pay limited attention to improving spatial-temporal features, particularly for infrared videos. To address this, we propose a novel Skeleton-guided spatial-Temporal feAture leaRning (STAR) method for VVI-ReID. By using skeleton information, which is robust to issues such as poor image quality and occlusions, STAR improves the accuracy of spatial-temporal features in videos of both modalities. Specifically, STAR employs two levels of skeleton-guided strategies: frame level and sequence level. At the frame level, the robust structured skeleton information is used to refine the visual features of individual frames. At the sequence level, we design a feature aggregation mechanism based on skeleton key points graph, which learns the contribution of different body parts to spatial-temporal features, further enhancing the accuracy of global features. Experiments on benchmark datasets demonstrate that STAR outperforms state-of-the-art methods. Code will be open source soon.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two major challenges in video visible - infrared person re - identification (VVI - ReID): 1. **Modality Discrepancies**: - There are significant feature differences between visible - light and infrared images, which make direct matching difficult. These differences include lighting conditions, background noise, etc. 2. **Spatial - Temporal Information Extraction**: - The spatial - temporal information in videos is crucial for accurately identifying pedestrians, but problems such as low - quality videos and occlusions can seriously affect the accuracy of spatial - temporal information. Most of the existing methods focus on reducing modality differences, but have limited investment in improving spatio - temporal features, especially when dealing with infrared videos. To solve these problems, the authors propose a skeleton - guided spatio - temporal feature learning method (Skeleton - guided spatial - Temporal feAture leaRning, STAR) to improve the accuracy of spatio - temporal features. The STAR method enhances the visual features of video frames by introducing skeleton information and further optimizes feature extraction through the following two strategies: - **Frame - level Skeleton Guidance**: Use structured skeleton information to correct the visual features of a single frame so that it can maintain high accuracy even in low - quality or partially occluded situations. - **Sequence - level Skeleton Guidance**: Design a feature aggregation mechanism based on skeleton key - point graphs to learn the contributions of different body parts to spatio - temporal features, thereby further improving the accuracy of global features. Through these improvements, the experimental results of the STAR method on multiple benchmark datasets show that its performance is better than existing methods, especially in cross - modal person re - identification tasks. ### Summary By introducing skeleton information, this paper solves the problems of modality differences and spatio - temporal information extraction in video visible - infrared person re - identification, and improves the accuracy and robustness of identification.

Skeleton-Guided Spatial-Temporal Feature Learning for Video-Based Visible-Infrared Person Re-Identification

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Spatial-Temporal Attention-aware Learning for Video-based Person Re-identification.

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

Exploring High-Order Spatio–Temporal Correlations from Skeleton for Person Re-Identification

Style-Agnostic Representation Learning for Visible-Infrared Person Re-Identification

Person Re-Identification By Video Ranking

Spatial and Temporal Mutual Promotion for Video-Based Person Re-Identification.

Adversarial Self-Attack Defense and Spatial-Temporal Relation Mining for Visible-Infrared Video Person Re-Identification

Joint Color-irrelevant Consistency Learning and Identity-aware Modality Adaptation for Visible-infrared Cross Modality Person Re-identification.

Video-based Person Re-identification with Long Short-Term Representation Learning

Cooperative Separation of Modality Shared-Specific Features for Visible-Infrared Person Re-Identification

Video-Based Person Re-Identification Using Spatial-Temporal Memory Coupling Network

Visible-Infrared Person Re-Identification Based on Frequency-Domain Simulated Multispectral Modality for Dual-Mode Cameras

Video-based Visible-Infrared Person Re-Identification with Auxiliary Samples

Person Re-Identification by Discriminative Selection in Video Ranking

Person Re-Identification by Unsupervised Video Matching.

Shape-centered Representation Learning for Visible-Infrared Person Re-identification

Stronger Heterogeneous Feature Learning for Visible-Infrared Person Re-Identification

Feature separation and double causal comparison loss for visible and infrared person re-identification