Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Wentao Tan,Changxing Ding,Jiayu Jiang,Fei Wang,Yibing Zhan,Dapeng Tao
2024-07-01
Abstract:Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main topic of this paper is the problem of text-to-image person re-identification (Text-to-image Person ReID). Existing Text-to-image ReID methods usually rely on manually annotated text descriptions, which limit the scale of the dataset and the generalization ability of the model. Therefore, researchers propose a new approach to generate large-scale training data using Multimodal Large Language Models (MLLMs) and deploy the model directly for evaluation on different datasets. The paper mentions two key challenges: 1. The text descriptions generated by MLLMs have similar structures, which may result in the model overfitting to specific sentence patterns and reduce its adaptability to diverse description styles in the real world. To address this issue, they propose the Template Diversity Enhancement (TDE) method, which generates diverse description templates through multi-round dialogues. MLLMs can then generate image descriptions based on these templates to increase the diversity of descriptions. 2. MLLMs may generate incorrect descriptions, where some words in the description may not match the image. To tackle this, they propose the Noise-Aware Masking (NAM) method, which identifies mismatched words by calculating the similarity between text tokens and image tokens. These mismatched words are then masked with a higher probability during subsequent training to mitigate the impact of noisy text descriptions. Experimental results show that these methods significantly improve the performance of direct transfer of Text-to-image ReID and achieve state-of-the-art performance in traditional evaluation settings. The paper also discusses the limitations of existing methods, such as low generalization ability across datasets, and proposes approaches for generating diversity and reducing noise using MLLMs. In summary, this paper aims to address how to use MLLMs to automatically generate diverse text descriptions and reduce noise in order to improve the cross-dataset generalization and direct transferability of Text-to-image ReID models.