Abstract:Text attribute person search aims to find specific pedestrians through given textual attributes, which is very meaningful in the scene of searching for designated pedestrians through witness descriptions. The key challenge is the significant modality gap between textual attributes and images. Previous methods focused on achieving explicit representation and alignment through unimodal pre-trained models. Nevertheless, the absence of inter-modality correspondence in these models may lead to distortions in the local information of intra-modality. Moreover, these methods only considered the alignment of inter-modality and ignored the differences between different attribute categories. To mitigate the above problems, we propose an Attribute-Aware Implicit Modality Alignment (AIMA) framework to learn the correspondence of local representations between textual attributes and images and combine global representation matching to narrow the modality gap. Firstly, we introduce the CLIP model as the backbone and design prompt templates to transform attribute combinations into structured sentences. This facilitates the model's ability to better understand and match image details. Next, we design a Masked Attribute Prediction (MAP) module that predicts the masked attributes after the interaction of image and masked textual attribute features through multi-modal interaction, thereby achieving implicit local relationship alignment. Finally, we propose an Attribute-IoU Guided Intra-Modal Contrastive (A-IoU IMC) loss, aligning the distribution of different textual attributes in the embedding space with their IoU distribution, achieving better semantic arrangement. Extensive experiments on the Market-1501 Attribute, PETA, and PA100K datasets show that the performance of our proposed method significantly surpasses the current state-of-the-art methods.

Learning shared features from specific and ambiguous descriptions for text-based person search

Improving Inconspicuous Attributes Modeling for Person Search by Language

Learning Semantic-Aligned Feature Representation for Text-based Person Search

Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

Learning to Name Faces

On the mechanisms and putative pathways involving neuroimmune interactions.

Adversarial Attribute-Text Embedding for Person Search with Natural Language Query

Beyond the Parts: Learning Coarse-to-Fine Adaptive Alignment Representation for Person Search

A Discriminatively Learned Feature Embedding Based on Multi-Loss Fusion for Person Search

Instance Enhancing Loss: Deep Identity-Sensitive Feature Embedding for Person Search

LAIP: Learning Local Alignment from Image-Phrase Modeling for Text-based Person Search

Text-based person search via cross-modal alignment learning

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval

Learning adaptive shift and task decoupling for discriminative one-step person search

An Adaptive Correlation Filtering Method for Text-Based Person Search

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Learning Context-Aware Embedding for Person Search

CLIP-based Camera-Agnostic Feature Learning for Intra-camera Person Re-Identification

DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval

Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search