Learning Recurrent 3D Attention for Video-Based Person Re-Identification

Guangyi Chen,Jiwen Lu,Ming Yang,Jie Zhou
DOI: https://doi.org/10.1109/TIP.2020.2995272
IF: 10.6
2020-01-01
IEEE Transactions on Image Processing
Abstract:In this paper, we propose to learn recurrent 3D attention (A3D) for video-based person re-identification. Attention model plays a key role in both spatial and temporal domains for video representation. Most existing methods apply spatial attention model to extract feature from a single image and aggregate image features with attentive temporal pooling or RNN. However, the inherent consistencies and correlations between spatial and temporal clues are not leveraged. Our A3D method aims to utilize the joint constraints of temporal and spatial attentions to enhance the robustness of attention model. Towards this goal, we treat the pedestrian video as a unified 3D bin where the temporal domain is denoted as an additional dimension. Then we develop an attention agent to iteratively select the locations of the salient spatial-temporal parts in the 3D bin. In addition, we formulate our sequential 3D attention learning as a Markov Decision Process and train the representation network and attention detector with the policy gradient method in an end-to-end manner. We evaluate the proposed method on three challenging datasets including iLIDS-VID, PRID-2011 and the large-scale MARS dataset, and consistently improve the performance in comparison with the state-of-the-art methods.
What problem does this paper attempt to address?