Abstract:The central problem of text-based person retrieval is how to properly bridge the gap between heterogeneous cross-modal data. Many of the previous works contrive to learn a latent common space to bridge the modality gap and extract modality-invariant feature vectors. Within these methods, the common space mapping and cross-modal information matching operations are conducted in a one-off manner, which aims to extract sufficient discriminative clues from the high-dimensional multi-modal data at first glance, but it is inconsistent with the fact that humans usually follow a step-by-step process to properly recognize and match two objects. Intuitively, the large heterogeneity gap between multi-modal data can be better bridged by gradually analyzing the complex cross-modal relationships. In this paper, we propose a Serialized Updating and Matching (SUM) method for text-based person retrieval to bridge the heterogeneity gap between cross-modal data in a step-by-step manner. The core component of SUM is the proposed Memory Gating Modules (MGM), which can be stacked to gradually update and match features extracted from visual/textual modalities. To fully excavate the correlations lie within multi-granular cross-modal data, two variants are designed to care for both global and fine-grain local information, namely, Global Memory Gating Module (GMGM) and Fine-grained Memory Gating Module (FMGM) with which the updating rate of information at each step is dynamically determined after observing the feature in opposite modality. Moreover, SUM can be flexibly utilized as an add-on to any multi-granular text-based person retrieval methods to further improve the performance. We evaluate our proposed method on two text-based person retrieval datasets CUHK-PEDES and RSTPReid along with two general cross-modal retrieval datasets Flickr8K and Flickr30K to see its generalization ability. Experimental results present that the proposed SUM outperforms existing methods and achieves the state-of-the-art performance.

A Baseline Investigation: Transformer-based Cross-view Baseline for Text-based Person Search

TIPCB: A simple but effective part-based convolutional baseline for text-based person search

Person Re-identification Based on Transform Algorithm

Multi-granularity Matching Transformer for Text-based Person Search

Hierarchical Gumbel Attention Network for Text-based Person Search

Text-Based Person Search with Limited Data

Asymmetric Cross-Scale Alignment for Text-Based Person Search

Sequential Transformer for End-to-End Person Search

Learning Semantic-Aligned Feature Representation for Text-based Person Search

An Overview of Text-based Person Search: Recent Advances and Future Directions

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

PSTR: End-to-End One-Step Person Search With Transformers

Point-level feature learning based on vision transformer for occluded person re-identification

TransMatcher: Deep Image Matching Through Transformers for Generalizable Person Re-identification

A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval

Text-based Person Search in Full Images via Semantic-Driven Proposal Generation

Multi-level network based on transformer encoder for fine-grained image–text matching

A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Text-based Person Search without Parallel Image-Text Data

Prototype-Guided Text-based Person Search based on Rich Chinese Descriptions

SUM: Serialized Updating and Matching for text-based person retrieval