Abstract:Video-based person re-identification (ReID) aims to exploit relevant features from spatial and temporal knowledge. Widely used methods include the part- and attention-based approaches for suppressing irrelevant spatial–temporal features. However, it is still challenging to overcome inconsistencies across video frames due to occlusion and imperfect detection. These mismatches make temporal processing ineffective and create an imbalance of crucial spatial information. To address these problems, we propose the Spatiotemporal Multi-Granularity Aggregation (ST-MGA) method, which is specifically designed to accumulate relevant features with spatiotemporally consistent cues. The proposed framework consists of three main stages: extraction, which extracts spatiotemporally consistent partial information; augmentation, which augments the partial information with different granularity levels; and aggregation, which effectively aggregates the augmented spatiotemporal information. We first introduce the consistent part-attention (CPA) module, which extracts spatiotemporally consistent and well-aligned attentive parts. Sub-parts derived from CPA provide temporally consistent semantic information, solving misalignment problems in videos due to occlusion or inaccurate detection, and maximize the efficiency of aggregation through uniform partial information. To enhance the diversity of spatial and temporal cues, we introduce the Multi-Attention Part Augmentation (MA-PA) block, which incorporates fine parts at various granular levels, and the Long-/Short-term Temporal Augmentation (LS-TA) block, designed to capture both long- and short-term temporal relations. Using densely separated part cues, ST-MGA fully exploits and aggregates the spatiotemporal multi-granular patterns by comparing relations between parts and scales. In the experiments, the proposed ST-MGA renders state-of-the-art performance on several video-based ReID benchmarks (i.e., MARS, DukeMTMC-VideoReID, and LS-VID).

Multitask Multigranularity Aggregation With Global-Guided Attention for Video Person Re-Identification

Multi-Granularity Aggregation with Spatiotemporal Consistency for Video-Based Person Re-Identification

Joining Features by Global Guidance with Bi-Relevance Trihard Loss for Person Re-Identification

Contribution-Based Multi-Stream Feature Distance Fusion Method with ${k}$ -Distribution Re-Ranking for Person Re-Identification

Contribution-Based Multi-Stream Feature Distance Fusion Method With <inline-formula> <tex-math notation="LaTeX">${k}$ </tex-math></inline-formula>-Distribution Re-Ranking for Person Re-Identification

Gaussian-based Probability Fusion for Person Re-Identification with Taylor Angular Margin Loss

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Multi-scale spatio-temporal feature adaptive aggregation for video-based Person Re -identification

Where to Look: Multi-Granularity Occlusion Aware for Video Person Re-Identification

Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification

Concentrated Multi-Grained Multi-Attention Network for Video Based Person Re-Identification

Multi-granular inter-frame relation exploration and global residual embedding for video-based person re-identification

AA-RGTCN: Reciprocal Global Temporal Convolution Network with Adaptive Alignment for Video-Based Person Re-Identification

Saliency and Granularity: Discovering Temporal Coherence for Video-Based Person Re-Identification

Relation-Guided Spatial Attention and Temporal Refinement for Video-Based Person Re-Identification.

Improving Description-based Person Re-identification by Multi-granularity Image-text Alignments

A Multi-Scale Spatial-Temporal Attention Model for Person Re-Identification in Videos

Hierarchical Temporal Modeling With Mutual Distance Matching for Video Based Person Re-Identification

Spatial and Temporal Mutual Promotion for Video-Based Person Re-Identification.

Enhancing identification for person search with multi-scale multi-grained representation learning

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification