Abstract:Video-based person re-identification (Re-ID) is designed to retrieve target pedestrians in video sequences under non-overlapping cameras. At present, mainstream approaches post-process the feature map extracted by the convolutional neural network backbone to obtain a global representation or a fine-grained local representation for higher accuracy. However, they still suffer from challenges, such as information loss for global-based methods and spatio-temporal feature fragmentation for local-based methods. To alleviate these problems, this paper proposes a Spatio-Temporal Feature Enhancement (STFE) network from a spatio-temporal comprehensive perspective, combining the advantages of the above methods to obtain more comprehensive information from video tracklets. STFE consists of two main modules: Feature Space Projection Module (FSPM) and Global Low-frequency Enhancement Module (GLEM). FSPM mathematically converts continuous video information into a discrete feature space and selectively retains more useful information, thus avoiding spatio-temporal information loss. Meanwhile, FSPM applies global features instead of dividing feature maps spatially, thereby avoiding spatio-temporal feature fragmentation. In addition, GLEM which is based on transformer, acts as a broadband low-pass filter to mine richer global comprehensive information. Finally, by combining FSPM with GLEM, STFE can obtain spatio-temporal comprehensive video representation. Extensive experiments were conducted on two widely-used video Re-ID datasets. The experimental results verify our idea and demonstrate the effectiveness of the proposed STFE with 95.5% Rank-1 accuracy on MARS benchmarks, which surpasses previous state-of-the-arts by a large margin of +4%.

Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification

Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification

Person Re-identification Based on Transform Algorithm

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

AA-RGTCN: Reciprocal Global Temporal Convolution Network with Adaptive Alignment for Video-Based Person Re-Identification

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification

Temporal-Contextual Attention Network for Video-Based Person Re-identification

Context Sensing Attention Network for Video-based Person Re-identification

Video-Based Person Re-Identification Using Spatial-Temporal Memory Coupling Network

Temporal Attribute-Appearance Learning Network for Video-based Person Re-Identification

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification

Temporal Complementary Learning for Video Person Re-Identification

An Unbiased Temporal Representation for Video-Based Person Re-Identification

Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Triplet Attention Network for Video-Based Person Re-Identification

VRSTC: Occlusion-Free Video Person Re-Identification

Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Multi-Level Fusion Temporal-Spatial Co-Attention for Video-Based Person Re-Identification

STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement