Abstract:Image-to-video person re-identification aims to retrieve the same pedestrian as the image-based query from a video-based gallery set. Existing methods treat it as a cross-modality retrieval task and learn the common latent embeddings from image and video modalities, which are both less effective and efficient due to large modality gap and redundant feature learning by utilizing all video frames. In this work, we first regard this task as point-to-set matching problem identical to human decision process, and propose a novel Temporal Complementarity-Guided Reinforcement Learning (TCRL) approach for image-to-video person re-identification. TCRL employs deep reinforcement learning to make sequential judgments on dynamically selecting suitable amount of frames from gallery videos, and accumulate adequate temporal complementary information among these frames by the guidance of the query image, towards balancing efficiency and accuracy. Specifically, TCRL formulates point-to-set matching procedure as Markov decision process, where a sequential judgement agent measures the uncertainty between the query image and all historical frames at each time step, and verifies that sufficient complementary clues are accumulated for judgment (same or different) or one more frames are requested to assist judgment. Moreover, TCRL maintains a sequential feature extraction module with complementary residual detectors to dynamically suppress redundant salient regions and thoroughly mine diverse complementary clues among these selected frames for enhancing frame-level representation. Extensive experiments demonstrate the superiority of our method.

Revisiting Temporal Modeling for Video-based Person ReID

Joining Features by Global Guidance with Bi-Relevance Trihard Loss for Person Re-Identification

Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification

Video-Based Person Re-identification with Improved Temporal Attention and Spatial Memory

AA-RGTCN: Reciprocal Global Temporal Convolution Network with Adaptive Alignment for Video-Based Person Re-Identification

Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos

See The Forest For The Trees: Joint Spatial And Temporal Recurrent Neural Networks For Video-Based Person Re-Identification

MSTN: A Multi-granular Spatial–Temporal Network for video-based person re-identification

Video-based Person Re-Identification Via Spatio-Temporal Attentional and Two-Stream Fusion Convolutional Networks

Hierarchical Temporal Modeling With Mutual Distance Matching for Video Based Person Re-Identification

Video Person Re-Identification by Temporal Residual Learning

Spatial-Temporal Graph Convolutional Network for Video-Based Person Re-Identification

Video-based Person Re-identification via 3D Convolutional Networks and Non-local Attention

Temporal Attention Quality Aware Network for Video-based Person Re-Identification

Superpixel-Based Temporally Aligned Representation For Video-Based Person Re-Identification

Learning Recurrent 3D Attention for Video-Based Person Re-Identification

Spatial and Temporal Mutual Promotion for Video-Based Person Re-Identification.

An Unbiased Temporal Representation for Video-Based Person Re-Identification

Spatial-Temporal Attention-aware Learning for Video-based Person Re-identification.

Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification

Spatial-Temporal Synergic Residual Learning for Video Person Re-Identification.