Abstract:Text attribute person search aims to find specific pedestrians through given textual attributes, which is very meaningful in the scene of searching for designated pedestrians through witness descriptions. The key challenge is the significant modality gap between textual attributes and images. Previous methods focused on achieving explicit representation and alignment through unimodal pre-trained models. Nevertheless, the absence of inter-modality correspondence in these models may lead to distortions in the local information of intra-modality. Moreover, these methods only considered the alignment of inter-modality and ignored the differences between different attribute categories. To mitigate the above problems, we propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images and combine global representation matching to narrow the modality gap. Firstly, we introduce the CLIP model as the backbone and design prompt templates to transform attribute combinations into structured sentences. This facilitates the model's ability to better understand and match image details. Next, we design a Masked Attribute Prediction (MAP) module that predicts the masked attributes after the interaction of image and masked textual attribute features through multi-modal interaction, thereby achieving implicit local relationship alignment. Finally, we propose an Attribute-IoU Guided Intra-Modal Contrastive (A-IoU IMC) loss, aligning the distribution of different textual attributes in the embedding space with their IoU distribution, achieving better semantic arrangement. Extensive experiments on the Market-1501 Attribute, PETA, and PA100K datasets show that the performance of our proposed method significantly surpasses the current state-of-the-art methods.

Cross-modal Co-occurrence Attributes Alignments for Person Search by Language

Improving Inconspicuous Attributes Modeling for Person Search by Language

Beyond the Parts: Learning Coarse-to-Fine Adaptive Alignment Representation for Person Search

Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search

Comprehensive Attribute Prediction Learning for Person Search by Language.

Text-based person search via cross-modal alignment learning

Hybrid Attention Network for Language-Based Person Search

Person Search by Multi-Scale Matching

Learning Semantic-Aligned Feature Representation for Text-based Person Search

Adaptive and Collaborative Multi-scale Alignment for Text-Based Person Search

Asymmetric Cross-Scale Alignment for Text-Based Person Search

Attentive Feature Focusing for Person Search by Natural Language

Fusing Two Directions in Cross-Domain Adaption for Real Life Person Search by Language.

Improving Cross-Modal Constraints: Text Attribute Person Search with Graph Attention Networks

See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval

Joint Token and Feature Alignment Framework for Text-Based Person Search.

Cross-Modal Knowledge Adaptation for Language-Based Person Search

Cascade Attention Network for Person Search: Both Image and Text-Image Similarity Selection.

Cascaded Cross-modal Alignment for Visible-Infrared Person Re-Identification

Cross-modal Generation and Alignment Via Attribute-guided Prompt for Unsupervised Text-based Person Retrieval

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark