Semantic Parsing and Attentive Feature-Temporal Pooling Network for Video-Based Person Image Retrieval

Yu Mao,Haiqing Du,Yong Liu
DOI: https://doi.org/10.1109/icdew.2019.00-10
2019-01-01
Abstract:Video person re-identification is a crucial task due to its applications in visual surveillance and human-computer interaction. The purpose of these kinds of algorithms are to search for the corresponding pedestrian image from a large number of cross-device surveillance videos with a given pedestrian image as a probe. In recent years, more and more scholars have begun to regard this problem as a special type of image retrieval. Existing works mainly focus on extracting representative features from the whole image and integrate those features in a sequence through temporal modeling. However, these approaches rarely consider harnessing local visual cues to enhance the power of image-level feature learning. In this paper, we propose a novel neural network which incorporate human semantic parsing to improve imag-elevel representations. Specifically, the human semantic parsing network is able to segment a human image into multiple parts with fine-grained semantics, and the following attentive feature pooling layer can select most significant body parts to enhance the power of feature representations. The carefully designed experiments on two public datasets show the effectiveness of each components of the proposed deep network, improving state-of-the-art video person sequence retrieval on: iLIDS-VID [1] by ~13% and PRID-2011 by ~7% in rank-1.
What problem does this paper attempt to address?