Abstract:Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text. In recent years, TBPS has made remarkable progress and state-of-the-art methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities, which is unreliable due to the lack of contextual information or the potential introduction of noise. Moreover, existing methods seldom consider the information inequality problem between modalities caused by image-specific information. To address these limitations, we propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels, and realize fast and effective person search. Specifically, we first design an image-specific information suppression module, which suppresses image background and environmental factors by relation-guided localization and channel attention filtration respectively. This module effectively alleviates the information inequality problem and realizes the alignment of information volume between images and texts. Secondly, we propose an implicit local alignment module to adaptively aggregate all pixel/word features of image/text to a set of modality-shared semantic topic centers and implicitly learn the local fine-grained correspondence between modalities without additional supervision and cross-modal interactions. And a global alignment is introduced as a supplement to the local perspective. The cooperation of global and local alignment modules enables better semantic alignment between modalities. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of our MANet.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve two main problems in the Text - based Person Search (TBPS) task: 1. **Information inequality**: Most of the existing methods, when dealing with images and texts, directly encode both to obtain feature representations, but overlook the information inequality caused by image - specific information (such as background and environmental factors). Specifically: - **Image background**: An image contains not only the pedestrian itself but also the surrounding background information, and these background information are not mentioned in the text description, thus widening the gap between modalities. - **Environmental factors**: Due to the influence of different camera parameters and environmental conditions (such as lighting, weather, viewing angles, etc.), the captured images have significant intra - class differences, and these environmental factors, as noise, increase the gap between modalities. 2. **Fine - grained correspondence relationship modeling**: Most of the existing methods rely on explicitly generated local parts to model the fine - grained correspondence relationship between modalities. This method is unreliable because of the lack of context information or the possible introduction of noise. In addition, direct interaction between modalities may affect each other, which is not conducive to the modeling of fine - grained correspondence relationships. To solve these problems, the authors propose a Multi - level Alignment Network (MANet), which contains the following modules: - **Image - Specific Information Suppression (ISS) module**: Through Relation - Guided Localization (RGL) and Channel Attention Filtration (CAF), it suppresses the image background and environmental factors respectively to achieve the alignment of the amount of information between modalities. - **Implicit Local Alignment (ILA) module**: By introducing a set of learnable modality - shared semantic topic centers, it implicitly learns the locally aligned image and text features, avoiding the unreliability of explicitly generating local parts and the influence of direct interaction between modalities. - **Global Alignment (GA) module**: By spatially aggregating the salient information in the image and temporally aggregating the salient information in the text, it maximizes their similarity in the joint embedding space to achieve global alignment. The combined use of these modules enables MANet to perform the text - based person search task more effectively.

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Text-based person search via cross-modal alignment learning

Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

Hierarchical Gumbel Attention Network for Text-based Person Search

Asymmetric Cross-Scale Alignment for Text-Based Person Search

TIPCB: A simple but effective part-based convolutional baseline for text-based person search

Adaptive and Collaborative Multi-scale Alignment for Text-Based Person Search

Text-based Person Search without Parallel Image-Text Data

Mind the Inconsistent Semantics in Positive Pairs: Semantic Aligning and Multimodal Contrastive Learning for Text-based Pedestrian Search

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Semi-supervised Text-based Person Search

Multi-granularity Matching Transformer for Text-based Person Search

Text-Based Person Search with Limited Data

An Overview of Text-based Person Search: Recent Advances and Future Directions

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search

Part-Based Multi-Scale Attention Network for Text-Based Person Search.

Local-enhanced Representation for Text-Based Person Search

Learning Semantic-Aligned Feature Representation for Text-based Person Search

Enhancing Visual Representation for Text-based Person Searching

HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Joint Token and Feature Alignment Framework for Text-Based Person Search.