Abstract:Video-based person re-identification (ReID) aims to exploit relevant features from spatial and temporal knowledge. Widely used methods include the part- and attention-based approaches for suppressing irrelevant spatial–temporal features. However, it is still challenging to overcome inconsistencies across video frames due to occlusion and imperfect detection. These mismatches make temporal processing ineffective and create an imbalance of crucial spatial information. To address these problems, we propose the Spatiotemporal Multi-Granularity Aggregation (ST-MGA) method, which is specifically designed to accumulate relevant features with spatiotemporally consistent cues. The proposed framework consists of three main stages: extraction, which extracts spatiotemporally consistent partial information; augmentation, which augments the partial information with different granularity levels; and aggregation, which effectively aggregates the augmented spatiotemporal information. We first introduce the consistent part-attention (CPA) module, which extracts spatiotemporally consistent and well-aligned attentive parts. Sub-parts derived from CPA provide temporally consistent semantic information, solving misalignment problems in videos due to occlusion or inaccurate detection, and maximize the efficiency of aggregation through uniform partial information. To enhance the diversity of spatial and temporal cues, we introduce the Multi-Attention Part Augmentation (MA-PA) block, which incorporates fine parts at various granular levels, and the Long-/Short-term Temporal Augmentation (LS-TA) block, designed to capture both long- and short-term temporal relations. Using densely separated part cues, ST-MGA fully exploits and aggregates the spatiotemporal multi-granular patterns by comparing relations between parts and scales. In the experiments, the proposed ST-MGA renders state-of-the-art performance on several video-based ReID benchmarks (i.e., MARS, DukeMTMC-VideoReID, and LS-VID).

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Video-Based Person Re-Identification Using Spatial-Temporal Memory Coupling Network

Contribution-Based Multi-Stream Feature Distance Fusion Method with ${k}$ -Distribution Re-Ranking for Person Re-Identification

Joining Features by Global Guidance with Bi-Relevance Trihard Loss for Person Re-Identification

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

A Multi-Scale Spatial-Temporal Attention Model for Person Re-Identification in Videos

Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification

AA-RGTCN: Reciprocal Global Temporal Convolution Network with Adaptive Alignment for Video-Based Person Re-Identification

Multi-scale spatio-temporal feature adaptive aggregation for video-based Person Re -identification

Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos

Saliency and Granularity: Discovering Temporal Coherence for Video-Based Person Re-Identification

Multi-Level Fusion Temporal-Spatial Co-Attention for Video-Based Person Re-Identification

BiCnet-TKS: Learning Efficient Spatial-Temporal Representation for Video Person Re-Identification

STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification

Spatial-Temporal Attention-aware Learning for Video-based Person Re-identification.

Cross-Modality Spatial-Temporal Transformer for Video-Based Visible-Infrared Person Re-Identification

Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification

Multitask Multigranularity Aggregation With Global-Guided Attention for Video Person Re-Identification

Person Re-Identification by Unsupervised Video Matching.

Spatial-Temporal Person Re-Identification