Abstract:In recent years, the field of video-based person re-identification (re-ID) has conducted in-depth research on how to effectively utilize spatiotemporal clues, which has attracted attention for its potential in providing comprehensive view representations of pedestrians. However, although the discriminability and correlation of spatiotemporal features are often studied, the exploration of the complex relationships between these features has been relatively neglected. Especially when dealing with multi-granularity features, how to depict the different spatial representations of the same person under different perspectives becomes a challenge. To address this challenge, this paper proposes a multi-granularity inter-frame relationship exploration and global residual embedding network specifically designed to solve the above problems. This method successfully extracts more comprehensive and discriminative feature representations by deeply exploring the interactions and global differences between multi-granularity features. Specifically, by simulating the dynamic relationship of different granularity features in long video sequences and using a structured perceptual adjacency matrix to synthesize spatiotemporal information, cross-granularity information is effectively integrated into individual features. In addition, by introducing a residual learning mechanism, this method can also guide the diversified development of global features and reduce the negative impacts caused by factors such as occlusion. Experimental results verify the effectiveness of this method on three mainstream benchmark datasets, significantly surpassing state-of-the-art solutions. This shows that this paper successfully solves the challenging problem of how to accurately identify and utilize the complex relationships between multi-granularity spatiotemporal features in video-based person re-ID.

Multi-scale Representation with Graph Learning for Video-Based Person Re-Identification

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Multi-scale Spatial-temporal Network for Person Re-identification

Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

Adaptive Graph Representation Learning for Video Person Re-identification

Multi-Scale Aligned Spatial-Temporal Interaction for Video-Based Person Re-Identification

Combine Coarse and Fine Cues: Multi-grained Fusion Network for Video-Based Person Re-identification

Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification

Multi-Scale 3D Convolution Network for Video Based Person Re-Identification.

Multi-Scale Relation Network for Person Re-identification.

Multi-Scale Temporal Cues Learning for Video Person Re-Identification

Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Multi-scale spatio-temporal feature adaptive aggregation for video-based Person Re -identification

Multi-Scale Cascading Network with Compact Feature Learning for RGB-Infrared Person Re-Identification.

Deep Siamese Network with Multi-level Similarity Perception for Person Re-identification

Video-based person re-identification with complementary local and global features using a graph transformer

Keypoint Message Passing for Video-based Person Re-Identification

Multi-granular inter-frame relation exploration and global residual embedding for video-based person re-identification

Multi-Level Fusion Temporal-Spatial Co-Attention for Video-Based Person Re-Identification

Graph-Based Multi-granularity Person Re-identification