Abstract:Objective Text-to-image person re-identification is a sub-task of image-text retrieval, which aims to retrieve the target person images corresponding to the given text description. The main challenge of the text-to-image person re-identification task is the significant feature gap between vision and language. The fine-grained matching between the semantic information of the two modalities is restricted by modal gap as well. The mixture of multiple local features and global feature are often adopted for cross-modal matching recently. These local-level matching methods are complicated and suppress the retrieval speed. Insufficient training data is still challenged for text-to-image person re-identification tasks as well. To alleviate this insufficiency, conventional methods are typically initialized their backbone models with weights pre-trained on single-modal large-scale datasets. However, this initialization method cannot be used to learn the information of fine-grained image-text cross-modal matching and its semantic alignment. Therefore, an easy-to-use method is required to optimize the cross-modal alignment for the text-to-image person re-identification model.Method We develop a transformer network with a temperature-scaled projection matching method and contrastive language-image pre-training（CLIP） for text-to-image person re-identification. The CLIP is a general multimodal foundation model pre-trained on largescale image-text datasets. The vision transformer is used as the visual backbone network to preserve fine-grained information, which can resolve the convolutional neural network（CNN）-based constraint of long-range relationships and downsampling. To optimize the cross-modal image-text alignment capability of the pre-trained CLIP model, our model is focused on fine-grained image-text semantic feature alignment using global features only. In addition, a temperature-scaled crossmodal projection matching（TCMPM） loss function is developed for image-text cross-modal feature matching as well. The TCMPM loss can be used to minimize the Kullback-Leibler（KL） divergence between temperature-scaled projection distributions and normalized true matching distributions in a mini-batch.Result Extensive experiments are carried out on two datasets in comparison with the latest text-to-image person re-identification methods. We adopt the two popular public datasets, CUHK person discription（CUHK-PEDES） and identity-centric and fine-grained person discription（ICFG-PEDES）, to validate the effectiveness of the proposed method. Rank-K（K = 1, 5, 10） are adopted as the retrieval evaluation metrics. On the CUHK-PEDES dataset, the Rank-1 value is improved by 5. 92% compared to the best performing existing local-level matching method, and it is improved by 7. 09% for existing global-level matching method. On the ICFG-PEDES dataset, the Rank-1 value is improved by 1. 21% for local-level matching model. The ablation studies are also carried out on the CUHK-PEDES and ICFG-PEDES dataset. Compared to original CMPM loss, the Rank-1 value of the TCMPM loss is improved by 9. 54% on the CUHK-PEDES dataset, and the Rank-1 value is improved by 4. 67% on the ICFG-PEDES dataset. Compared to the InfoNCE loss, a commonly-used loss in cross-modal comparative learning, the Rank-1 value can be improved by 3. 38% on the CUHK-PEDES dataset in terms of the TCMPM loss, and the Rank-1 value is improved by 0. 42% on the ICFG-PEDES dataset.Conclusion An end-to-end dual Transformer network is developed to learn representations of person images and descriptive texts in the text-to-image person re-identification. We demonstrate that the globallevel matching method has its potential to outperform current state-of-the-art local-level matching methods. The transformer network can resolve the problem that CNN cannot model the long-range relationship and detailed information-loss for down-sampling. In addition, our proposed method can benefit from the powerful cross-modal alignment capability of CLIP, and together with our further designed TCMPM loss, our model can thus learn more discriminative image-text features.

SCMM: Calibrating Cross-modal Representations for Text-Based Person Search

TIPCB: A simple but effective part-based convolutional baseline for text-based person search

Improving Text-based Person Search via Part-level Cross-modal Correspondence

Text-Based Person Search with Limited Data

Text-based person search via cross-modal alignment learning

Asymmetric Cross-Scale Alignment for Text-Based Person Search

Enhancing Visual Representation for Text-based Person Searching

Mind the Inconsistent Semantics in Positive Pairs: Semantic Aligning and Multimodal Contrastive Learning for Text-based Pedestrian Search

Hierarchical Gumbel Attention Network for Text-based Person Search

Transformer Network for Cross-Modal Text-to-image Person Re-Identification

Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search

Beat: Bi-directional One-to-Many Embedding Alignment for Text-based Person Retrieval

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

Modal Complementarity Based on Multimodal Large Language Model for Text-Based Person Retrieval

Learning Semantic-Aligned Feature Representation for Text-based Person Search

Multi-granularity Matching Transformer for Text-based Person Search

Text-based Person Search without Parallel Image-Text Data

Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval

MARS: Paying more attention to visual attributes for text-based person search

Fine-Granularity Alignment for Text-Based Person Retrieval Via Semantics-Centric Visual Division

Masking-Based Cross-Modal Remote Sensing Image–Text Retrieval via Dynamic Contrastive Learning