Abstract:Objective Text-to-image person re-identification is a sub-task of image-text retrieval, which aims to retrieve the target person images corresponding to the given text description. The main challenge of the text-to-image person re-identification task is the significant feature gap between vision and language. The fine-grained matching between the semantic information of the two modalities is restricted by modal gap as well. The mixture of multiple local features and global feature are often adopted for cross-modal matching recently. These local-level matching methods are complicated and suppress the retrieval speed. Insufficient training data is still challenged for text-to-image person re-identification tasks as well. To alleviate this insufficiency, conventional methods are typically initialized their backbone models with weights pre-trained on single-modal large-scale datasets. However, this initialization method cannot be used to learn the information of fine-grained image-text cross-modal matching and its semantic alignment. Therefore, an easy-to-use method is required to optimize the cross-modal alignment for the text-to-image person re-identification model.Method We develop a transformer network with a temperature-scaled projection matching method and contrastive language-image pre-training（CLIP） for text-to-image person re-identification. The CLIP is a general multimodal foundation model pre-trained on largescale image-text datasets. The vision transformer is used as the visual backbone network to preserve fine-grained information, which can resolve the convolutional neural network（CNN）-based constraint of long-range relationships and downsampling. To optimize the cross-modal image-text alignment capability of the pre-trained CLIP model, our model is focused on fine-grained image-text semantic feature alignment using global features only. In addition, a temperature-scaled crossmodal projection matching（TCMPM） loss function is developed for image-text cross-modal feature matching as well. The TCMPM loss can be used to minimize the Kullback-Leibler（KL） divergence between temperature-scaled projection distributions and normalized true matching distributions in a mini-batch.Result Extensive experiments are carried out on two datasets in comparison with the latest text-to-image person re-identification methods. We adopt the two popular public datasets, CUHK person discription（CUHK-PEDES） and identity-centric and fine-grained person discription（ICFG-PEDES）, to validate the effectiveness of the proposed method. Rank-K（K = 1, 5, 10） are adopted as the retrieval evaluation metrics. On the CUHK-PEDES dataset, the Rank-1 value is improved by 5. 92% compared to the best performing existing local-level matching method, and it is improved by 7. 09% for existing global-level matching method. On the ICFG-PEDES dataset, the Rank-1 value is improved by 1. 21% for local-level matching model. The ablation studies are also carried out on the CUHK-PEDES and ICFG-PEDES dataset. Compared to original CMPM loss, the Rank-1 value of the TCMPM loss is improved by 9. 54% on the CUHK-PEDES dataset, and the Rank-1 value is improved by 4. 67% on the ICFG-PEDES dataset. Compared to the InfoNCE loss, a commonly-used loss in cross-modal comparative learning, the Rank-1 value can be improved by 3. 38% on the CUHK-PEDES dataset in terms of the TCMPM loss, and the Rank-1 value is improved by 0. 42% on the ICFG-PEDES dataset.Conclusion An end-to-end dual Transformer network is developed to learn representations of person images and descriptive texts in the text-to-image person re-identification. We demonstrate that the globallevel matching method has its potential to outperform current state-of-the-art local-level matching methods. The transformer network can resolve the problem that CNN cannot model the long-range relationship and detailed information-loss for down-sampling. In addition, our proposed method can benefit from the powerful cross-modal alignment capability of CLIP, and together with our further designed TCMPM loss, our model can thus learn more discriminative image-text features.

Deep Cross-Modal Projection Learning For Image-Text Matching

CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval

Advanced Multimodal Deep Learning Architecture for Image-Text Matching

Deep Coupled Metric Learning for Cross-Modal Matching.

Deep Pairwise Ranking with Multi-label Information for Cross-Modal Retrieval.

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Intra-Modal Constraint Loss for Image-Text Retrieval

Cross-modal Deep Metric Learning with Multi-Task Regularization

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Cross-modal Image Retrieval with Deep Mutual Information Maximization

Cross-modal Contrastive Learning for Generalizable and Efficient Image-text Retrieval

MiC: Image-text Matching in Circles with cross-modal generative knowledge enhancement

Enhancing Separate Encoding with Multi-layer Feature Alignment for Image-Text Matching

Uniting Image and Text Deep Networks Via Bi-directional Triplet Loss for Retreival

Cross-modal Semantically Augmented Network for Image-text Matching

Less is Better: Exponential Loss for Cross-Modal Matching

Image–Text Matching Model Based on CLIP Bimodal Encoding

Learning Coupled Feature Spaces for Cross-Modal Matching

Transformer Network for Cross-Modal Text-to-image Person Re-Identification