Abstract:Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. In recent years, TBPS has made remarkable progress and state-of-the-art methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities, which is unreliable due to the lack of contextual information or the potential introduction of noise. Moreover, existing methods seldom consider the information inequality problem between modalities caused by image-specific information. To address these limitations, we propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels, and realize fast and effective person search. Specifically, we first design an image-specific information suppression module, which suppresses image background and environmental factors by relation-guided localization and channel attention filtration respectively. This module effectively alleviates the information inequality problem and realizes the alignment of information volume between images and texts. Secondly, we propose an implicit local alignment module to adaptively aggregate all pixel/word features of image/text to a set of modality-shared semantic topic centers and implicitly learn the local fine-grained correspondence between modalities without additional supervision and cross-modal interactions. And a global alignment is introduced as a supplement to the local perspective. The cooperation of global and local alignment modules enables better semantic alignment between modalities. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of our MANet.

Enhancing Visual Representation for Text-based Person Searching

Text-Guided Visual Feature Refinement for Text-Based Person Search

Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

Multi-granularity Matching Transformer for Text-based Person Search

TIPCB: A simple but effective part-based convolutional baseline for text-based person search

An Overview of Text-based Person Search: Recent Advances and Future Directions

MGRL: Mutual-Guidance Representation Learning for Text-to-Image Person Retrieval.

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Learning Semantic-Aligned Feature Representation for Text-based Person Search

Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Enhancing CLIP-Based Text-Person Retrieval by Leveraging Negative Samples.

Text-Based Person Search with Limited Data

Local-enhanced Representation for Text-Based Person Search

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Text-based person search via cross-modal alignment learning

Adversarial Attribute-Text Embedding for Person Search with Natural Language Query