Abstract:In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.

Temporal Correlation Vision Transformer for Video Person Re-Identification

Person Re-identification Based on Transform Algorithm

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification

RETRACTED CHAPTER: Person Re-identification Based on Transform Algorithm

Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Temporal Complementary Learning for Video Person Re-Identification

Video-based person re-identification with complementary local and global features using a graph transformer

Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

AA-RGTCN: Reciprocal Global Temporal Convolution Network with Adaptive Alignment for Video-Based Person Re-Identification

Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification

Spatial and Temporal Mutual Promotion for Video-Based Person Re-Identification.

Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification.

Revisiting Temporal Modeling for Video-based Person ReID

Person Re-Identification by Unsupervised Video Matching.

Temporal-Contextual Attention Network for Video-Based Person Re-identification

Video-Based Person Re-Identification Using Spatial-Temporal Memory Coupling Network