Abstract:Text-Based Person Search (TBPS) is a crucial task that enables accurate retrieval of target individuals from large-scale galleries with only given textual caption. For cross-modal TBPS tasks, it is critical to obtain well-distributed representation in the common embedding space to reduce the inter-modal gap. Furthermore, learning detailed image-text correspondences is essential to discriminate similar targets and enable fine-grained search. To address these challenges, we present a simple yet effective method named Sew Calibration and Masked Modeling (SCMM) that calibrates cross-modal representations by learning compact and well-aligned embeddings. SCMM is distinguished by two novel losses to provide fine-grained cross-modal representations: 1) a Sew calibration loss that takes the quality of textual captions as guidance and aligns features between image and text modalities, and 2) a Masked Caption Modeling (MCM) loss that leverages a masked caption prediction task to establish detailed and generic relationships between textual and visual parts. The dual-pronged strategy refines feature alignment and enriches cross-modal correspondences, enabling the accurate distinction of similar individuals. Consequently, its streamlined dual-encoder architecture avoids complex branches and interactions and facilitates high-speed inference suitable for real-time requirements. Consequently, high-speed inference is achieved, which is essential for resource-limited applications often demanding real-time processing. Extensive experiments on three popular TBPS benchmarks demonstrate the superiority of SCMM, achieving top results with 73.81%, 74.25%, and 57.35% Rank-1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReID, respectively. We hope SCMM's scalable and cost-effective design will serve as a strong baseline and facilitate future research in this field.

Improving Text-based Person Search via Part-level Cross-modal Correspondence

PLOT: Text-based Person Search with Part Slot Attention for Corresponding Part Discovery

TIPCB: A simple but effective part-based convolutional baseline for text-based person search

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

Learning Semantic-Aligned Feature Representation for Text-based Person Search

Hierarchical Gumbel Attention Network for Text-based Person Search

Text-based person search via cross-modal alignment learning

Local-enhanced Representation for Text-Based Person Search

ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer

An Adaptive Correlation Filtering Method for Text-Based Person Search

Text-Based Person Search with Limited Data

Beyond the Parts: Learning Coarse-to-Fine Adaptive Alignment Representation for Person Search

Cascade Attention Network for Person Search: Both Image and Text-Image Similarity Selection.

Text-based Person Search via Multi-Granularity Embedding Learning

Adaptive and Collaborative Multi-scale Alignment for Text-Based Person Search

A Simple and Robust Correlation Filtering Method for Text-Based Person Search.

Part-Based Multi-Scale Attention Network for Text-Based Person Search.

Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval

Cross-modal Co-occurrence Attributes Alignments for Person Search by Language

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Divide-and-Merge the Embedding Space for Cross-Modality Person Search