Joint Token and Feature Alignment Framework for Text-Based Person Search.

Shangze Li,Andong Lu,Yan Huang,Chenglong Li,Liang Wang
DOI: https://doi.org/10.1109/lsp.2022.3217682
2022-01-01
IEEE Signal Processing Letters
Abstract:Text-based person search is a challenging cross-modal retrieval task. Existing works reduce the inter-modality and intra-class gaps by aligning local features extracted from image and text modalities, which easily lead to mismatching problems due to the lack of annotation information. Besides, it is sub-optimal to reduce two gaps simultaneously in the same feature space. This work proposes a novel joint token and feature alignment framework to reduce the inter-modality and intra-class gaps progressively. Specifically, we first build a dual-path feature learning network to extract features and conduct feature alignment to reduce the inter-modality gap. Second, we design a text generation module to generate token sequences using visual features, and then token alignment is performed to reduce the intra-class gap. Last, a fusion interaction module is introduced to further eliminate the modality heterogeneity using the strategy of multi-stage feature fusion. Extensive experiments on the CUHK-PEDES dataset demonstrate the effectiveness of our model, which significantly outperforms previous state-of-the-art methods.
What problem does this paper attempt to address?