Part-Based Multi-Scale Attention Network for Text-Based Person Search.

Yubin Wang,Ding Qi,Cairong Zhao
DOI: https://doi.org/10.1007/978-3-031-18907-4_36
2022-01-01
Abstract:Text-based person search aims to retrieve the target person in an image gallery based on textual descriptions. Solving such a fine-grained cross-modal retrieval problem is very challenging due to differences between modalities. Moreover, the inter-class variance of both person images and descriptions is small, and more semantic information is needed to assist in aligning visual and textual representations at different scales. In this paper, we propose a Part-based Multi-Scale Attention Network (PMAN) capable of extracting visual semantic features from different scales and matching them with textual features. We initially extract visual and textual features using ResNet and BERT, respectively. Multi-scale visual semantics is then acquired based on local feature maps of different scales. Our proposed method learns representations for both modalities simultaneously based mainly on Bottleneck Transformer with self-attention mechanism. A multi-scale cross-modal matching strategy is introduced to narrow the gap between modalities from multiple scales. Extensive experimental results show that our method outperforms the state-of-the-art methods on CUHK-PEDES datasets.
What problem does this paper attempt to address?