Abstract:The goal of video-based person re-identification (Re-ID) is to identify the same person across multiple non-overlapping cameras. The key to accomplishing this challenging task is to sufficiently exploit both spatial and temporal cues in video sequences. However, most current methods are incapable of accurately locating semantic regions or efficiently filtering discriminative spatio-temporal features; so it is difficult to handle issues such as spatial misalignment and occlusion. Thus, we propose a novel feature aggregation framework, multi-task and multi-granularity aggregation with global-guided attention (MMA-GGA), which aims to adaptively generate more representative spatio-temporal aggregation features. Specifically, we develop a multi-task multi-granularity aggregation (MMA) module to extract features at different locations and scales to identify key semantic-aware regions that are robust to spatial misalignment. Then, to determine the importance of the multi-granular semantic information, we propose a global-guided attention (GGA) mechanism to learn weights based on the global features of the video sequence, allowing our framework to identify stable local features while ignoring occlusions. Therefore, the MMA-GGA framework can efficiently and effectively capture more robust and representative features. Extensive experiments on four benchmark datasets demonstrate that our MMA-GGA framework outperforms current state-of-the-art methods. In particular, our method achieves a rank-1 accuracy of 91.0% on the MARS dataset, the most widely used database, significantly outperforming existing methods.

A Multi-Scale Spatial-Temporal Attention Model for Person Re-Identification in Videos

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Spatial-Temporal Attention-aware Learning for Video-based Person Re-identification.

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

Temporal-Contextual Attention Network for Video-Based Person Re-identification

ASTA-Net: Adaptive Spatio-Temporal Attention Network for Person Re-Identification in Videos.

Learning Recurrent 3D Attention for Video-Based Person Re-Identification

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification

Concentrated Multi-Grained Multi-Attention Network for Video Based Person Re-Identification

Multi-Level Fusion Temporal-Spatial Co-Attention for Video-Based Person Re-Identification

Multi-scale spatio-temporal feature adaptive aggregation for video-based Person Re -identification

Parallel Attention with Weighted Efficient Network for Video-Based Person Re-Identification.

Multi-Scale Temporal Cues Learning for Video Person Re-Identification

DMM: Dual-Modal Model for Person Re-Identification

Multi-scale local-global architecture for person re-identification

Leader-Based Multi-Scale Attention Deep Architecture for Person Re-Identification

Video-Based Person Re-Identification Using Spatial-Temporal Memory Coupling Network

Multi-Scale Body-Part Mask Guided Attention for Person Re-identification

MvHAAN: multi-view hierarchical attention adversarial network for person re-identification

Multitask Multigranularity Aggregation With Global-Guided Attention for Video Person Re-Identification

Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification.