Abstract:Video-based person re-identification (Re-ID) aims at retrieving the person through the video sequences across non-overlapping cameras. Some characteristics of pedestrians are not consecutive across frames due to the variations of viewpoints, postures, and occlusions over time. However, existing methods ignore such data peculiarity and the networks tend to only learn those salient consecutive characteristics among frames in video sequences. As a result, the learned representations fail to cover all the characteristics of pedestrians, thus lacking integrity and discrimination. To tackle this problem, we present a novel deep architecture termed Hierarchical Mining Network (HMN), which mines as many pedestrians’ characteristics by referring to the temporal and intra-class knowledge. It consists of a novel Attentive Temporal Module (ATM) and a Dynamic Supervising Branch (DSB), with a Balancing Triplet Loss (BTL) assisting the training. The proposed ATM, with pedestrian perceiving capacity, is capable of evaluating each activation of features through temporal analysis, so that the temporally scattered characteristics of pedestrians can be better aggregated and the contaminated ones can be eliminated. Then, the DSB along with the BTL further enhances the integrity of representations by multiple supervision. Specifically, the DSB perceives the diversities of intra-class samples in each mini-batch and generates targeted supervising signals for them, in which process the BTL guarantees the signals with smaller intra-class variations and larger inter-class variations. Comprehensive experiments on two video-based datasets, i.e., MARS, and DukeMTMC-VideoReID, demonstrate the contribution of each component and the superiority of the proposed HMN over the state-of-the-arts. Benchmarking our model on three popular image-based datasets, i.e., Market1501, DukeMTMC-Reid, and MSMT17 additionally verifies the promising generalizability of the proposed DSB and BTL.

Semantic Parsing and Attentive Feature-Temporal Pooling Network for Video-Based Person Image Retrieval

Jointly Attentive Spatial-Temporal Pooling Networks for Video-based Person Re-Identification.

Parallel Attention with Weighted Efficient Network for Video-Based Person Re-Identification.

Dual Attention Matching Network for Context-Aware Feature Sequence based Person Re-Identification

Video-based Person Re-Identification Via Spatio-Temporal Attentional and Two-Stream Fusion Convolutional Networks

Temporal Attention Quality Aware Network for Video-based Person Re-Identification

Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification

Spatial and Temporal Mutual Promotion for Video-Based Person Re-Identification.

Attention-guided Spatial–temporal Graph Relation Network for Video-Based Person Re-Identification

Temporal Attribute-Appearance Learning Network for Video-based Person Re-Identification

Deep Recurrent Convolutional Networks for Video-based Person Re-identification: An End-to-End Approach

Deep Spatial-Temporal Fusion Network for Video-Based Person Re-identification.

Spatial-Temporal Synergic Residual Learning for Video Person Re-Identification.

Hierarchical Integration of Rich Features for Video-Based Person Re-Identification.

Robust Video-Based Person Re-Identification by Hierarchical Mining

Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Multi-Level Fusion Temporal-Spatial Co-Attention for Video-Based Person Re-Identification

Video Person Re-Identification by Temporal Residual Learning

Discriminative feature extraction for video person re-identification via multi-task network

STFE: A Comprehensive Video-Based Person Re-Identification Network Based on Spatio-Temporal Feature Enhancement

PeR-ViS: Person Retrieval in Video Surveillance using Semantic Description