Abstract:Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.

What problem does this paper attempt to address?

The main topic of this paper is the problem of text-to-image person re-identification (Text-to-image Person ReID). Existing Text-to-image ReID methods usually rely on manually annotated text descriptions, which limit the scale of the dataset and the generalization ability of the model. Therefore, researchers propose a new approach to generate large-scale training data using Multimodal Large Language Models (MLLMs) and deploy the model directly for evaluation on different datasets. The paper mentions two key challenges: 1. The text descriptions generated by MLLMs have similar structures, which may result in the model overfitting to specific sentence patterns and reduce its adaptability to diverse description styles in the real world. To address this issue, they propose the Template Diversity Enhancement (TDE) method, which generates diverse description templates through multi-round dialogues. MLLMs can then generate image descriptions based on these templates to increase the diversity of descriptions. 2. MLLMs may generate incorrect descriptions, where some words in the description may not match the image. To tackle this, they propose the Noise-Aware Masking (NAM) method, which identifies mismatched words by calculating the similarity between text tokens and image tokens. These mismatched words are then masked with a higher probability during subsequent training to mitigate the impact of noisy text descriptions. Experimental results show that these methods significantly improve the performance of direct transfer of Text-to-image ReID and achieve state-of-the-art performance in traditional evaluation settings. The paper also discusses the limitations of existing methods, such as low generalization ability across datasets, and proposes approaches for generating diversity and reducing noise using MLLMs. In summary, this paper aims to address how to use MLLMs to automatically generate diverse text descriptions and reduce noise in order to improve the cross-dataset generalization and direct transferability of Text-to-image ReID models.

Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

MLLMReID: Multimodal Large Language Model-based Person Re-identification

Retrieve Anyone: A General-purpose Person Re-identification Task with Instructions

Adaptive multi-task learning for cross domain and modal person re-identification

Group-aware Label Transfer for Domain Adaptive Person Re-identification

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification

Multi‐level Cross‐modality Learning Framework for Text‐based Person Re‐identification

Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification

When Large Vision-Language Models Meet Person Re-Identification

VLUReID: Exploiting Vision-Language Knowledge for Unsupervised Person Re-Identification

Text-and-Image Learning Transformer for Cross-modal Person Re-identification

CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels

Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

MSBA: Multiple Scales, Branches and Attention Network with Bag of Tricks for Person Re-Identification

Learning Granularity-Unified Representations for Text-to-Image Person Re-identification

M2M-GAN: Many-to-Many Generative Adversarial Transfer Learning for Person Re-Identification

Learning Transferable Pedestrian Representation from Multimodal Information Supervision

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions

Cross-modality neighbor constraints based unbalanced multi-view text–image re-identification

Text-Based Occluded Person Re-identification Via Multi-Granularity Contrastive Consistency Learning