Adaptive and Collaborative Multi-scale Alignment for Text-Based Person Search

Xinxin Yang,Renjie Pan,Hua Yang
DOI: https://doi.org/10.1109/vcip59821.2023.10402622
2023-01-01
Abstract:Text-to-image person search is challenging due to the cross-scale correspondences and information inequality between modalities. Specifically, images and text are complexly linked at different scales and images are usually more informative and complete than text. It is crucial to establish semantic correlations between modalities and focus on task-relevant information in images. In this paper, we propose a novel Adaptive and Collaborative Multi-scale Alignment network (ACMA) for text-based person search that learns semantically consistent and information-aligned multi-modal representations. Firstly, we introduce a novel joint embedding module that adaptively integrates features of different pixels and words, thereby extracting semantically consistent multi-modal features at different scales. Second, we design a cross-modal fusion feature-based auxiliary visual branch to guide the extraction of key visual features that are beneficial for cross-modal matching. Extensive experiments validate that ACMA outperforms the state-of-the-art method.
What problem does this paper attempt to address?