Abstract:In recent years, significant progress has been made in video-based person re-identification (Re-ID). The key challenge in video person Re-ID lies in effectively constructing discriminative and robust person feature representations. Methods based on local regions utilize spatial and temporal attention to extract representative local features. However, prior approaches often overlook the correlations between local regions. To leverage relationships among different local regions, we have proposed a novel video person Re-ID representation learning approach based on a graph transformer, which facilitates contextual interactions between relevant region features. Specifically, we construct a local relation graph to model intrinsic relationships between nodes representing local regions. This graph employs the architecture of a transformer for feature propagation, iteratively refining region features and considering information from adjacent nodes to obtain partial feature representations. To learn compact and discriminative representations, we have further proposed a global feature learning branch based on a vision transformer to capture the relationships between different frames in a sequence. Additionally, we designed a dual-branch interaction network based on multi-head fusion attention to integrate frame-level features extracted by both local and global branches. Finally, the concatenated global and local features, after interaction, are used for testing. We evaluated the proposed method on three datasets, namely iLIDS-VID, MARS, and DukeMTMC-VideoReID. Experimental results demonstrate competitive performance, validating the effectiveness of our proposed approach.

Spatial-temporal Graph-Guided Global Attention Network for Video-Based Person Re-Identification

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Spatial-Temporal Attention-aware Learning for Video-based Person Re-identification.

Learning Recurrent 3D Attention for Video-Based Person Re-Identification

Adaptive Graph Representation Learning for Video Person Re-identification

Pixel-wise Graph Attention Networks for Person Re-identification

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

A Multi-Scale Spatial-Temporal Attention Model for Person Re-Identification in Videos

ASTA-Net: Adaptive Spatio-Temporal Attention Network for Person Re-Identification in Videos.

Video-based Person Re-Identification Via Spatio-Temporal Attentional and Two-Stream Fusion Convolutional Networks

Multi-layer Attention for Person Re-Identification

Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Temporal-Contextual Attention Network for Video-Based Person Re-identification

Global-Local Temporal Representations For Video Person Re-Identification

Learning Multi-Attention Context Graph for Group-Based Re-Identification

AA-RGTCN: Reciprocal Global Temporal Convolution Network with Adaptive Alignment for Video-Based Person Re-Identification

Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification.

Learning global and local features using graph neural networks for person re-identification

Parallel Attention with Weighted Efficient Network for Video-Based Person Re-Identification.

Video-based person re-identification with complementary local and global features using a graph transformer